Leveraging Machine Learning and IUPAC names to identify TDP1 inhibitors

Ivanova, Mariya, Russo, Nicola, Mihaylov, Gueorgui and Konstantin, Nikolic ORCID logoORCID: https://orcid.org/0000-0002-6551-2977 (2025) Leveraging Machine Learning and IUPAC names to identify TDP1 inhibitors. Computational and Structural Biotechnology. (Submitted)

[thumbnail of PDF/A]
Preview
PDF (PDF/A)
Leveraging Machine Learning and IUPAC names to identify TDP1 inhibitors_IvanovaM_accessible.pdf - Submitted Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

This paper introduces a series of computational approaches to assist drug developers in the discovery of new inhibitors for human tyrosyl-DNA phosphodiesterase 1 (TDP1), using a PubChem bioassay as a case study. The methodologies are underpinned by a custom dataset, generated by tokenizing the IUPAC names of compounds into over 5,000 distinct functional group fragments, with labels derived from the initial TDP1 inhibition data. First, a RandomForestClassifier (RFC) model was developed to predict whether a compound was a TDP1 inhibitor. Trained on more than 94,000 samples, the model demonstrated a strong performance with an accuracy of 70.9% and an ROC score of 70.8%. Building on this, two subsequent approaches provided deeper insights into the structural characteristics influencing inhibition. By reordering the feature importance list based on the proportion of active cases, the research identified the expected effects of specific functional groups. This was further refined by pinpointing the most and least desirable fragments for TDP1 inhibition. A separate, highly efficient CID_SID ML model was also developed, using only compound and substance identifiers from PubChem. This model outperformed the RFC model, achieving a superior accuracy of 85.2% and a precision of 94.2%, demonstrating the potential for rapid and effective screening using simplified input data. Collectively, these methods offer valuable computational tools for accelerating the drug discovery process.

Item Type: Article
Keywords: scikit-learn, PubChem, HTS, bioassay, CID_SID ML model
Subjects: Computing
Depositing User: Mariya Ivanova
Date Deposited: 16 Sep 2025 13:49
Last Modified: 16 Sep 2025 14:00
URI: https://repository.uwl.ac.uk/id/eprint/14068

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item

Menu