TY - JOUR
T1 - canSAR chemistry registration and standardization pipeline
AU - Dolciami, Daniela
AU - Villasclaras-Fernandez, Eloy
AU - Kannas, Christos
AU - Meniconi, Mirco
AU - Al-Lazikani, Bissan
AU - Antolin, Albert A.
N1 - Funding Information:
AAA is primarily supported by a Wellcome Trust Sir Henry Wellcome Postdoctoral Fellowship (204735/Z/16/Z); the People Programme (Marie Curie Actions) of the 7th Framework Programme of the European Union (FP7/2007–2013) under REA Grant Agreement No. 600388 (TECNIOspring programme); and the Agency of Business Competitiveness of the Government of Catalonia, ACCIO. BA-L, DD and MM are funded by a Cancer Research UK (CRUK) programme grant to the CRUK Cancer Therapeutics Unit (grant C309/A11566). BA-L receives research funding from a Wellcome Trust biomedical resource and technology Grant to sustain and develop the Chemical Probes Portal (212969/Z/18/Z) and from canSAR Research UK Drug Discovery Committee strategic award to sustain and develop canSAR (C35696/A23187). DD and EV-F are funded by a CRUK Drug Discovery Committee strategic award to sustain and develop canSAR (C35696/A23187). We acknowledge infrastructure support from CRUK for the ICR CRUK Centre and NHS funding to the NIHR Biomedical Research Centre at the ICR and Royal Marsden NHS Foundation Trust. The open access fees were covered by the ICR Sir John Beckwith Library.
Funding Information:
We thank the Heather Beckwith Charitable Settlement and The John L. Beckwith Charitable Trust for their generous support of our High Performance Computing facility. We would like to thank all the members of the canSAR Team and the canSAR’s Scientific Advisory Board whose advice, guidance and commitment to the development of canSARchem have been invaluable. We would also want to thank Ian Collins, John Overington and Huabin Hu for their help assessing the pipeline.
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/12
Y1 - 2022/12
N2 - Background: Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach. Results: We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step. Conclusions: We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline.
AB - Background: Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach. Results: We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step. Conclusions: We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline.
KW - Canonicalization
KW - canSAR
KW - Compound hierarchy
KW - FDA-approved drugs
KW - KNIME
KW - Standardization
KW - Tautomerism
UR - http://www.scopus.com/inward/record.url?scp=85130967936&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85130967936&partnerID=8YFLogxK
U2 - 10.1186/s13321-022-00606-7
DO - 10.1186/s13321-022-00606-7
M3 - Article
C2 - 35643512
AN - SCOPUS:85130967936
SN - 1758-2946
VL - 14
JO - Journal of Cheminformatics
JF - Journal of Cheminformatics
IS - 1
M1 - 28
ER -