Leveraging 13C NMR spectrum data derived from SMILES for machine learning-based prediction of a small biomolecule functionality: a case study on human Dopamine D1 receptor antagonists

Ivanova, Mariya; Russo, Nicola; Mihaylov, Gueorgui; Konstantin, Nikolic

Leveraging 13C NMR spectrum data derived from SMILES for machine learning-based prediction of a small biomolecule functionality: a case study on human Dopamine D1 receptor antagonists

Lists

Ivanova, Mariya, Russo, Nicola, Mihaylov, Gueorgui and Konstantin, Nikolic ORCID: https://orcid.org/0000-0002-6551-2977 (2025) Leveraging 13C NMR spectrum data derived from SMILES for machine learning-based prediction of a small biomolecule functionality: a case study on human Dopamine D1 receptor antagonists. Advance Intelligent Discovery. (Submitted)

Preview

PDF (PDF/A)
Leveraging 13C NMR spectrum data derived from SMILES for machine learning-based prediction of a small biomolecule functionality_preprint_IvanovaM_accessible.pdf - Submitted Version
Available under License Creative Commons Attribution.
Download (1MB) | Preview

Abstract

This study contributes to ongoing research which aims to predict small biomolecule functionality using Carbon-13 Nuclear Magnetic Resonance (13C NMR) spectrum data and machine learning. A bioassay on human dopamine D1 receptor antagonists was used to demonstrate the approach. The Simplified Molecular Input Line Entry System (SMILES) notations of the compounds were extracted and converted into spectroscopic data using purpose-built software. This data was then used for machine learning with scikit-learn algorithms. The ML models were trained with 27,756 samples and tested with 5,466. Of the estimators tested (K-Nearest neighbor, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, XGBoost Classifier, and Support Vector Classifier), the Support Vector Classifier was found to be the most effective, achieving 71.5% accuracy and a cross-validation score of 0.749. The methodology can be applied to predict the functionality of any compound, provided relevant data are available. It was also hypothesized that an increase in sample numbers would lead to increased accuracy. Additionally, a time- and cost-efficient CID_SID ML model was also developed, allowing compounds to be checked for D1 receptor antagonist activity using only their PubChem identifiers. This model's metrics were 80.2% accuracy and a five-fold cross-validation score of 0.8071.

Item Type:	Article
Keywords:	scikit-learn, drug development, drug discovery, CID-SID ML model, neurotransmitter
Subjects:	Computing > Intelligent systems
Date Deposited:	17 Sep 2025
URI:	https://repository.uwl.ac.uk/id/eprint/14071