Define:
… of non-canonical amino acids, from an extensive database of 466 substances from INClusiveDB, derived from about 687 peer-reviewed publications.
Foreword
Given my extensive need to clarify everything in a huge pile of text,
I squeezed most of it in hidden layers in the Overview Section.
Moreover, feel free to click on the Table of Contents (Menu Bar) on
the left, to jump to the desired section!
Presented work has been performed due to my involvement in the recent
PROSPERO
preprint for EU Horozon2020 project called PARC, where we aim to define the
environmental and medical impact of contaminants.
The datasets used are not presented here until the whole work is published. However, this overview creates a perspective of how programming can help find substances of interest for research studies.
Click on the buttons below to expand information on each and every step
of the way!
Define:
… of non-canonical amino acids, from an extensive database of 466 substances from INClusiveDB, derived from about 687 peer-reviewed publications.
Dataset was taken from a INClusiveDB, a database deposited in the end
of 2023, to an official website of its owners [1], who
also wrote an extensive research paper on their work [2].
The database contains of different non-canonical amino acids (ncAAs) successfully incorporated into target proteins, which is usually done to either allow for the protein imaging in tests aiming to define the occurrence of the specific protein in the cell, or to allow for reshaping the functionality of the given protein [3].
Database was composed together with SMILES descriptors, which made
the initial preparation of data quite easy. While no missing data
(SMILES column) were found, several substances consisted of atoms which
are problematic for 3D modeling. Given that the pharmacophore search /
inverse docking server I used, did not accept those atoms (and their
corresponding noncAA molecules), I decided to disregard those substances
from the overall analysis.
This left me with 437 substances to work with, from the total of 466.
For clarity and simplicity, I changed the names of the molecules to a format, where the molecule from the INClusiveDB csv file appearing as the first element had the name of P001, and the last molecule, meaning the 437th molecule, had the shortcut of P437.
If I later wanted to explore data within the initial CSV file from INClusiveDB, I just used SMILES descriptors to do so.
1) HCA
Given the extensive amount of substances (around 440) using HCA
grouping method was not very sufficient as this algorithm gets lost when
there are too many variables involved.
Moreover HCA graphs are dendrograms, which look like evolutionary
trees. This means using it on a big dataset of molecules would result in
a huge image with more than 450 lines to decipher, and such clustering
would only bring more additional need to read and remember all of the
correlations. Not to mention a huge memory requirements to even save
that picture in a good enough resolution.
Last but not least, HCA results deeply depend on distance metric and the linkage criteria. This means that the clustering can deeply change dependent on the options we decide to choose here.
HCA groups our 437 substances, but does it in a very unhelpful
way.
2) PCA
A much more practical choice here is PCA with k-means clusters, which come in a form of a scatterplot graph - much easier to read while we have a lot of input variables. This method can deeply help with reducing the dimensionality of the data, meaning more or less “squeezing the data together based on their potential similarities which come up from the descriptors provided in the columns of the initial dataset.
If PCA is done together with k-means clustering - it can offer a very elegant and easy-to-read solution to understand how our big dataset can be simplified, and which groups can represent similar structures.
Furthermore, after such grouping - HCA could be additionally used on specific smaller groups, especially if each cluster now would contain only around 100 substances each.
Scree plot to define the right PCA no. components:
For the proper analysis, a number of components must first be defined. This means how much variance can be explained by a given number of comparison groups (PCs).
Usually a number between 1 and 10 is enough, but in huge datasets, such numbers can vary up to 50-60. General rule of thumbs is that the chosen number of components for PCA analysis should be explaining over 0.9 of the cummulative variance. This way we can be most certain about the acquired division results (they can be most statistically significant).
Given that non-canonical amino acids presented in the INClusiveDB are not so consistant when it comes to their structure, they might vary between each other very much, hence the number of PC might be quite big (meaning the dataset is not very homogenous).
Shockingly, we’d need at least 44 components to reach 0.9
variance!
Hence, I chose PC45 to perform the PCA with enough variance explained.
3) k-means
Last, but not least, I needed to define the number of clusters/groups
that would be best to define the whole 437 substances from INClusiveDB
according to the k-means algorithm.
Although an elbow was nowhere to be found, silhouette suggested more
clearly to choose 5 clusters.
In theory I could stay with 3 clusters, but considering the huge
amount of analyzed molecules, 5 clusters are a much safer option!
Such result shows how important it can be to use more than 1 method sometimes!
Amino acids and proteins represent very specific molecules, prone to the attack of human hydrolyses (enzymes which are also commonly nicknamed as molecular scissors. However, their effectivity within oral consumption is still checked with ADME by different researchers, for a general perspective.Though not perfect, this still offers a perspective, acknowledged by professional researchers worldwide [1, 2].Especially that noncAA are environmental eco-waste in many cases and can be either inhaled or eaten without knowledge of the receiver.
ADME was calculated by measuring 2D and 3D descriptors with a Python module, RDKit. While other modules (like Mordred) are also common, RDKit is very up to date and doesn’t require creating a specific environment with an older Python version that corresponds to 2018.
Besides the graph on solubility, I also present which of the substances passed Lipiński Rule of Five, model that simplifies whether a given molecule can perform its function within the body without concern to its half1life (in regard to oral bioavailability).
In order to properly analyze the functionality and fate of molecules in our body one can choose a method relying on the docking of suggested noncAA against a database of known and deposited structures of enzymes. Such fishing for the interested enzyme with out substances of interest, shows an additional layer to how the substance can be used in medicine, but also to the metabolic route of the substance.
Server I chose for this job was an open source PharmMapper[1, 2], which seems to be the most detailed and available of those on the market. It’s one drawback is that it does not allow deposition of multiple files at once, which was the reason why I had to deposit idach one separately.
Before that, all SMILES were converted to sdf files with OpenBabel and structures were then minimized with VegaZZ, resulting in mol2 format with adjusted hydrogen amount, position of the atoms, as well as the calculation of energy.
Then each and every one of them were deposited in PharmMapper.
After all of the jobs for the INClusiveDB non canonical amino acids were complete - instead of clicking to retrieve each and every one of them, I decided to write a script that would do that for me. For that job, I used BeautifulSoup module in Python.
All acquired PDB IDs were transformed to UniProtIDs by ID mapping.
One of the best Python modules for webscrapping is BeautifulSoup.
I created a variable storing all PharmMapperIDs and then created their href links with the use of if else loop. Then I needed to open each of the href’s to initialize the creation of csc files from the server. New links were then created and the final csv files stored in those links were downloaded with another loop, where IDs were changed for simplification to the format P001.
From each CVS file, through a for loop, 2 first rows were deleted as they only contained a line with ID value, and so now column names represented the first row. Then, only the column with PDB IDs was copied and transformed to UniProtIDs using downloaded database copy of UniProtDB, loop and a function to change IDs. Last but not least, only human UniProtIDs were saved - and their functionality was examined with the Enrichment Analysis Method, where each enzyme is checked within a dictionary of genetic functions, which basically translates to “which organ/disease can be differed due to this molecule?”.
Enrichment is simply checking each and every enzyme acquired, to understand their functionality and importance inside the human cell. Some enzymes dysregulation can be the root of diseases and such check can help understand whether given substance is capable of impacting for example cancer progression or diabetes treatment.
I used modules like clusterProfiler and DAVIDpy to access the Gene Ontology and KEGG features.
GeneCards is a service that
combines which human enzymes relate to given aspects like disease or
functionality. After scraping sets of enzymes relating to neurotoxicity,
I compared enzyme IDs to those acquired for noncAA that I studied.
Interestingly, 200 molecules in the INClusiveDB were found with around
80-99 enzymes relating to neurotoxicity.
After closer considerations and PubChem search, only 8 of those 200 had enough research data to actually suppose the correlation to neurotoxicity. This shows just how little we yet know on the toxicity of environmental substances, like non-canonical amino acids. Given the rise of peptide and peptidomimetic drugs in the last 10 years, it is important to spread awareness as well as fund research programs that could decipher that cytotoxicity before industrial production.
Although a small amount of “safe bets” were present - the INClusiveDB presents a great resource from which one could decide to choose several noncAA to furthes buy or synthesize - and then use against in vitro cell cultures to further confirm their impact on neurotoxicity or any other calamity.
Mentioned 8 substances were included in the recent PROSPERO preprint for EU Horozon2020 project called PARC, where we aim to define the environmental and medical impact of contaminants.
This website was rendered using R, HTML, CSS and JavaScript, all nested inside RMarkdown module, which allows for further translation of the file to HTML, DOC or PDF.
HCA as mentioned, is going to be unclear and not so helpful, but will still be shown here as an additional component. Interestingly, while the algorithm groups data in 2 big clusters, it also shows that one sole molecule seems to be much different than all others.
That molecule was represented by the shortcut P393 and can be more or less visible by tracking the blue line in the dendrogram.
Looking at the chemical structure of P393, it’s no wonder that it’s plotted so differently! Its molecular weight is 762.509 g/mol, when all other analyzed molecules have MW under 400 g/mol!!
Moreover, the number of valence electrones is… 332! Whereas all other molecules do not exceed 150!
Just for a graphical comparison, here are 2 different molecules from the INClusiveDB, picked by random:
PCA with 45 components with 5 clusters of k-means for 437 substances in total. 45 components corresponded to 90.52% of explained variance.
Interestingly, P393, the outlier in HCA, here is also presented as
the most different entity, and also creates a separate cluster (Cluster
5) in k-means - which comes as an additional validation
between all 3 unsupervised machine learning algorithms presented here
(HCA, PCA and k-means).
Of course, such difference between P393 and all other points - squeezes the datapoints together and makes the graph less clear, hence P393 was deleted and the analysis was run again for a better perspective. This way P393 will not be visible however, so we should keep in mind its existence.
Moreover, the change was NOT done by the limitation to the x axis, because this messed up the labels, and hence the deletion of the point P393 was a better option.
Of course several labels are not well visible due to the fact that we have almost 500 points on the same graph. This means that additional comparison with Venn graphs might be useful, when we’d like to define which elements are in a given cluster and for example, a given ADME group, later in the analysis.
Most consistant k-means clusters are 2nd and 5th, where all grouped molecules are most closely related. [however we need to remember that those clusters are the most consistent after the deletion of P393! If that is not deleted, then cluster 5 consist of only 1 molecule - P393]
Another outliers, though characterized with much less difference,
were P266 & P366.
Table below shows initial k-means cluster division count.
Clusters | With.P393 | Without.P393 |
---|---|---|
Cluster 1 | 68 | 103 |
Cluster 2 | 117 | 132 |
Cluster 3 | 202 | 56 |
Cluster 4 | 49 | 84 |
Cluster 5 | 1 | 62 |
Lipinski’s Rule of Five gave a very consistent result with only ONE
MOLECULE NOT PASSING THE CRITERIA!!
After using unsupervised ML methods, it was not surprising that this
molecule was indeed… P393! It also was not found to pass the
gastrointestinal tract and the blood-brain barrier, when TPSA and logP
values were correlated in a BOILED-Egg graph.
Furthermore:
Number of molecules that WERE found to be passing the blood-brain barrier (BBB): 64
Number of molecules NOT found to be passing the blood-brain barrier (BBB): 373
Number of molecules NOT found to be passing the gastro-intestinal barrier (GIA): 42
The latter are labeled in the graph below
General conclusion: Although very small and generally passing the Lipinski’s Rule of Five - non-canonical amino acids seem to be more prone to be metabolised within the gastrointestinal tract, rather than affecting the brain. 64 molecules that were found to be able to pass BBB were saved as a separate list to later be compared with the molecules supposedly affecting neurotoxicity.
In other words, can we predict if the molecule is passing the blood-brain barrier by the unsupervised clustering? Do these elements correlate with each other in this dataset?
Venn:
– to be filled yet –
Given the extensive amount of data (437 molecules), here I’ll only
provide a snippet for general understanding of what kind of comparisons
I’m able to do.
This specific venn graph shows how many shared elements (here
enzymes/genes) were found for: - P212 - P266 - P343 - P366 - P399
By the up-right corner we have the number of input genes/enzymes for each of the compared molecules and the nested venn graph shows how much they overlay with each other!
The graph was made with R and the library “venndetail” coming from
Bioconductor package family, one of the biggest packages for
visualization of chemoinformatic & bioinformatic data.
# Check how many uni IDs are shared between 5 aromatic entities
res <- venndetail(list(P212 = P212,
P266 = P266,
P343 = P343,
P366 = P366,
P399 = P399
))
# Make a pie graph of the results
pie <- vennpie(res,log = TRUE) # make a pie chart
# Save that venn pie graph to a png file
png("pie_test.png", width=10, height=8, units="in", res=300)
par(mar=c(4,4,4,4)) # set margins
plot(pie)
dev.off()
Graph represents how many total UniProtIDs were acquired for each of the 4 examples from analyzed molecules (P366, 343, 266, 212)