Characterizing Uncertainty in Machine Learning for Chemistry

Characterizing Uncertainty in Machine Learning for Chemistry

Both literature and social media are full of success stories of machine learning (ML) in drug design. Machine learning has been successfully applied to solve various drug design-related problems - property prediction, force field calculations, synthetic planning and protein folding and many more. 

However, for each success story where ML has found a practical application, tens of models failed when used in real-life scenarios. It shouldn’t surprise anyone, but chemistry is full of complex problems.  

The models and techniques that have been proven useful in other fields tend to fail when confronted with cheminformatics, and it seems reasonable to ask the important questions – Why is that? and How can we fix it? 

Esther Heid and her colleagues try to answer this precise question in their recent paper ”Characterizing Uncertainty in Machine Learning for Chemistry”. In this paper, they analyzed dozens of successful and failed ML attempts in chemistry, and performed a series of well-thought experiments to show the sources of error in ML in the chemical domain. 

The authors have used both artificial and real-life data to train different models using different architectures and featurization techniques. In their analysis, they have separated error sources into three categories:  

  1. originating from input data,  
  2. bias – originating from architecture and featurization,  
  3. variance – originating from uncertainty of model optimization. 

The article gives hints on how to estimate the contribution of each type of error and strategies that can be applied to reduce it. This article is a must read for anyone working with ML in this field, whether you are just starting or a hardened veteran. 

What are your tips for reducing error? Please let me know in the comments? 

#machinelearning #ml #cheminformatics #cadd #artificialintelligence #ai #sigmaitpoland


Great share. Cells synthesize proteins through a process where linear chains fold into unique three-dimensional structures, which is crucial for proper cellular function. Misfolding or lack of folding can lead to diseases. For over 50 years, prediction of proper protein folds has been challenging, but computational biologists have made strides and determined folds for 160,000 proteins. However, folds for 4,800 important proteins remained unresolved until Google's DeepMind introduced AlphaFold and AlphaFold2. Trained on vast experimental data, AlphaFold2 successfully identified the folds for most of the unknown proteins, earning its creators a $3 million prize in 2022. RoseTTAFold, another Generative Pretrained Transformer built on AlphaFold2's principles, predicted folds for an additional 56 previously unknown proteins. Although AlphaFold2 and RoseTTAFold differ in their approaches and accuracy, they represent significant advancements in understanding protein folding. Indeed, open-source availability is encouraging further research and improvement in these transformative techniques. More about this topic: https://lnkd.in/gPjFMgy7

Like
Reply

My view is that many (most?) global model for drug discovery are actually ensembles of local models. This means that you may be extrapolating when you think you're interpolating. This article may be relevant: https://doi.org/10.1021/ci0342472

Karol Bubała

Java developer by heart, python developer by necessity, cheminformatician by accident

1y

Grat thanks to the authors of the paper: Esther Heid, Charles McGill, Florence Vermeire and William Green

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics