Given my extensive need to clarify everything in a huge pile of text, I squeezed most of it in hidden layers in the Overview Section.

Moreover, feel free to click on the Table of Contents (Menu Bar) on the left, to jump to the desired section!

Presented work has been performed due to my involvement in the recent PROSPERO preprint for EU Horozon2020 project called PARC, where we aim to define the environmental and medical impact of contaminants.

The datasets used are not presented here until the whole work is published. However, this overview creates a perspective of how programming can help find substances of interest for research studies.

Overview of big data analysis

Click on the buttons below to expand information on each and every step of the way!

Define:

similarity

solubility

initial metabolic fate

potential neurotoxicity

… of non-canonical amino acids, from an extensive database of 466 substances from INClusiveDB, derived from about 687 peer-reviewed publications.

Dataset was taken from a INClusiveDB, a database deposited in the end of 2023, to an official website of its owners [1], who also wrote an extensive research paper on their work [2].

The database contains of different non-canonical amino acids (ncAAs) successfully incorporated into target proteins, which is usually done to either allow for the protein imaging in tests aiming to define the occurrence of the specific protein in the cell, or to allow for reshaping the functionality of the given protein [3].

Database was composed together with SMILES descriptors, which made the initial preparation of data quite easy. While no missing data (SMILES column) were found, several substances consisted of atoms which are problematic for 3D modeling. Given that the pharmacophore search / inverse docking server I used, did not accept those atoms (and their corresponding noncAA molecules), I decided to disregard those substances from the overall analysis.

This left me with 437 substances to work with, from the total of 466.

For clarity and simplicity, I changed the names of the molecules to a format, where the molecule from the INClusiveDB csv file appearing as the first element had the name of P001, and the last molecule, meaning the 437th molecule, had the shortcut of P437.

If I later wanted to explore data within the initial CSV file from INClusiveDB, I just used SMILES descriptors to do so.

1) HCA

Given the extensive amount of substances (around 440) using HCA grouping method was not very sufficient as this algorithm gets lost when there are too many variables involved.

Moreover HCA graphs are dendrograms, which look like evolutionary trees. This means using it on a big dataset of molecules would result in a huge image with more than 450 lines to decipher, and such clustering would only bring more additional need to read and remember all of the correlations. Not to mention a huge memory requirements to even save that picture in a good enough resolution.

Last but not least, HCA results deeply depend on distance metric and the linkage criteria. This means that the clustering can deeply change dependent on the options we decide to choose here.

HCA groups our 437 substances, but does it in a very unhelpful way.

2) PCA

A much more practical choice here is PCA with k-means clusters, which come in a form of a scatterplot graph - much easier to read while we have a lot of input variables. This method can deeply help with reducing the dimensionality of the data, meaning more or less “squeezing the data together based on their potential similarities which come up from the descriptors provided in the columns of the initial dataset.

If PCA is done together with k-means clustering - it can offer a very elegant and easy-to-read solution to understand how our big dataset can be simplified, and which groups can represent similar structures.

Furthermore, after such grouping - HCA could be additionally used on specific smaller groups, especially if each cluster now would contain only around 100 substances each.

Scree plot to define the right PCA no. components:

For the proper analysis, a number of components must first be defined. This means how much variance can be explained by a given number of comparison groups (PCs).

Usually a number between 1 and 10 is enough, but in huge datasets, such numbers can vary up to 50-60. General rule of thumbs is that the chosen number of components for PCA analysis should be explaining over 0.9 of the cummulative variance. This way we can be most certain about the acquired division results (they can be most statistically significant).

Given that non-canonical amino acids presented in the INClusiveDB are not so consistant when it comes to their structure, they might vary between each other very much, hence the number of PC might be quite big (meaning the dataset is not very homogenous).

Shockingly, we’d need at least 44 components to reach 0.9 variance!

Hence, I chose PC45 to perform the PCA with enough variance explained.

3) k-means

Last, but not least, I needed to define the number of clusters/groups that would be best to define the whole 437 substances from INClusiveDB according to the k-means algorithm.

To do that I chose 2 different methods:

elbow method

aims to choose a point where diminishing returns are no longer worth the additional cost of adding next clusters

correlates to a most “angled”, sharp fragment of the whole plot, which can look indeed like an elbow!

silhouette scores

describes how well each data point is assigned to its cluster compared to other clusters

ranges from -1 to 1, with higher values indicating better-defined clusters and a score of 1 suggesting perfect separation

Although an elbow was nowhere to be found, silhouette suggested more clearly to choose 5 clusters.

In theory I could stay with 3 clusters, but considering the huge amount of analyzed molecules, 5 clusters are a much safer option!

Such result shows how important it can be to use more than 1 method sometimes!

Amino acids and proteins represent very specific molecules, prone to the attack of human hydrolyses (enzymes which are also commonly nicknamed as molecular scissors. However, their effectivity within oral consumption is still checked with ADME by different researchers, for a general perspective.Though not perfect, this still offers a perspective, acknowledged by professional researchers worldwide [1, 2].Especially that noncAA are environmental eco-waste in many cases and can be either inhaled or eaten without knowledge of the receiver.

ADME was calculated by measuring 2D and 3D descriptors with a Python module, RDKit. While other modules (like Mordred) are also common, RDKit is very up to date and doesn’t require creating a specific environment with an older Python version that corresponds to 2018.

Besides the graph on solubility, I also present which of the substances passed Lipiński Rule of Five, model that simplifies whether a given molecule can perform its function within the body without concern to its half1life (in regard to oral bioavailability).

In order to properly analyze the functionality and fate of molecules in our body one can choose a method relying on the docking of suggested noncAA against a database of known and deposited structures of enzymes. Such fishing for the interested enzyme with out substances of interest, shows an additional layer to how the substance can be used in medicine, but also to the metabolic route of the substance.

Server I chose for this job was an open source PharmMapper[1, 2], which seems to be the most detailed and available of those on the market. It’s one drawback is that it does not allow deposition of multiple files at once, which was the reason why I had to deposit idach one separately.

Before that, all SMILES were converted to sdf files with OpenBabel and structures were then minimized with VegaZZ, resulting in mol2 format with adjusted hydrogen amount, position of the atoms, as well as the calculation of energy.

Then each and every one of them were deposited in PharmMapper.

After all of the jobs for the INClusiveDB non canonical amino acids were complete - instead of clicking to retrieve each and every one of them, I decided to write a script that would do that for me. For that job, I used BeautifulSoup module in Python.

All acquired PDB IDs were transformed to UniProtIDs by ID mapping.

One of the best Python modules for webscrapping is BeautifulSoup.

I created a variable storing all PharmMapperIDs and then created their href links with the use of if else loop. Then I needed to open each of the href’s to initialize the creation of csc files from the server. New links were then created and the final csv files stored in those links were downloaded with another loop, where IDs were changed for simplification to the format P001.

From each CVS file, through a for loop, 2 first rows were deleted as they only contained a line with ID value, and so now column names represented the first row. Then, only the column with PDB IDs was copied and transformed to UniProtIDs using downloaded database copy of UniProtDB, loop and a function to change IDs. Last but not least, only human UniProtIDs were saved - and their functionality was examined with the Enrichment Analysis Method, where each enzyme is checked within a dictionary of genetic functions, which basically translates to “which organ/disease can be differed due to this molecule?”.

Enrichment is simply checking each and every enzyme acquired, to understand their functionality and importance inside the human cell. Some enzymes dysregulation can be the root of diseases and such check can help understand whether given substance is capable of impacting for example cancer progression or diabetes treatment.

I used modules like clusterProfiler and DAVIDpy to access the Gene Ontology and KEGG features.

GeneCards is a service that combines which human enzymes relate to given aspects like disease or functionality. After scraping sets of enzymes relating to neurotoxicity, I compared enzyme IDs to those acquired for noncAA that I studied. Interestingly, 200 molecules in the INClusiveDB were found with around 80-99 enzymes relating to neurotoxicity.

After closer considerations and PubChem search, only 8 of those 200 had enough research data to actually suppose the correlation to neurotoxicity. This shows just how little we yet know on the toxicity of environmental substances, like non-canonical amino acids. Given the rise of peptide and peptidomimetic drugs in the last 10 years, it is important to spread awareness as well as fund research programs that could decipher that cytotoxicity before industrial production.

Although a small amount of “safe bets” were present - the INClusiveDB presents a great resource from which one could decide to choose several noncAA to furthes buy or synthesize - and then use against in vitro cell cultures to further confirm their impact on neurotoxicity or any other calamity.

Mentioned 8 substances were included in the recent PROSPERO preprint for EU Horozon2020 project called PARC, where we aim to define the environmental and medical impact of contaminants.

This website was rendered using R, HTML, CSS and JavaScript, all nested inside RMarkdown module, which allows for further translation of the file to HTML, DOC or PDF.

Clusters	With.P393	Without.P393
Cluster 1	68	103
Cluster 2	117	132
Cluster 3	202	56
Cluster 4	49	84
Cluster 5	1	62

portfolio_1

Klaudia Chmielewska

Last updated: 2025-07-15

Overview of big data analysis

Unsupervised Machine Learning

ADME and druglikeness

Do k-means clusters correspond to ability to pass BBB?

Comparison of UniProtIDs