DUD-E, Where’s My Accuracy?
Bias in the frequently used DUD-E dataset explains much or all of the apparent performance of some convolutional neural networks in predicting drug target interactions
In computational drug discovery, one of the most important objectives is modelling the interactions of a small molecule drug with a target protein. Computations from first principles, i.e. with knowledge of the molecular structure, but without contribution from measured drug target interactions (DTI), have been the mainstay of computational drug discovery since the seventies (see link). For a variety of reasons, the accuracy of these methods, which range in complexity from molecular docking to molecular dynamics/quantum mechanical (MM/QM) simulation, has remained limited, and efforts to improve that accuracy have yielded diminishing returns in recent years. At Cyclica, we look at interactions of a drug not just with one or a few proteins, but essentially all proteins in the human body (see link). This makes it especially important for predictions of drug-target interactions (DTI) to be fast and accurate, because compute time and false positive rate both increase with the number of proteins that are examined.
With rapid growth of available data on protein structures in the PDB now approaching proteome-wide coverage (Somody et al, 2017, a Cyclica paper), as well as millions of experimentally known DTIs in databases like Chembl, PubChem, and STITCH, there has been intense interest in recent years to apply deep learning models to the prediction of DTIs.
One such approach is the use of convolutional neural networks trained on atomistic 3d structural models of the drug-target complex (3d-CNN). This method was first proposed in a manuscript that has yet to be published under peer review (Wallach et al, 2015, arxiv.org). An open source implementation of this methodology, named Gnina, was later published and discussed in an excellent and thorough paper by David Koes’ group at the University of Pittsburgh (Ragoza et al, 2017).
One major shortcoming of all 3d-CNN methods is the need for high-quality 3d-structure data on the protein/ligand complex. The number of protein/ligand complexes with known 3d-structure is in the tens of thousands (16,151 as counted in 2018, here). This may seem like a lot, but it is small for a convolutional deep learning approach, and is dwarfed by the tens of millions of DTI activity data points in Chembl (15,504,603 at the time of writing, as counted here). The DTI data, however, comes without a 3d structural representation of the binding complex, and can not be used for training 3d-CNN models by itself. In view of the high number of parameters optimized for a 3d-CNN deep learning network, it is thus clear that the available 3d-structure data is insufficient to train accurate models that properly generalize beyond the data they are given.
However, lack of training data is not the worst problem plaguing current efforts on 3d-CNN models. In a recent, well-written and thorough manuscript by Tom Kurtzman’s group at CUNY - that has yet to be published under peer review - Chen et al. ("Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening", 2019, chemrxiv.org) go much further in their analysis of 3d-CNN methods, in collaboration with David Koes from UPitt, who is a co-author. They focus on methods based on the widely used DUD-E dataset, such as the one described by Wallach et. al, above. First, they observe the above mentioned lack of data:
However, the dearth of reliable protein-ligand x-ray structures and binding affinity data has required the use of constructed datasets for the training and evaluation of CNN molecular recognition models.
Then, they focus specifically on the DUD-E dataset:
Here, we outline various sources of bias in one such widely-used dataset, the Directory of Useful Decoys: Enhanced (DUD-E). We have constructed and performed tests to investigate whether CNN models developed using DUD-E are properly learning the underlying physics of molecular recognition, as intended, or are instead learning biases inherent in the dataset itself.
Their rather devastating conclusion is:
We find that superior enrichment efficiency in CNN models can be attributed to the analogue and decoy bias hidden in the DUD-E dataset rather than successful generalization of the pattern of protein-ligand interactions.
To recapitulate these findings, it appears that the UPitt group developed and published the Gnina 3d-CNN method in 2017 and initially found it more accurate than docking. Not content with reservations already raised in that publication regarding lack of data and insufficient generalization, the group at CUNY proceeded to investigate further doubts about the use of the DUD-E dataset for training and testing of 3d-CNN models. After very thorough analysis, they concluded that, in fact, the apparent gains in accuracy are artifacts produced by hidden bias in the dataset. The wording in their conclusion quoted above in fact indicates that they believe that the entirety of the apparent superior enrichment efficiency of DUD-E trained CNN models is artificial.
They demonstrate this in two ways: They show that when the 3d-CNN method is applied to more carefully selected datasets not suffering from the identified bias, no increase of accuracy over docking is observed. They further show that the DUD-E based 3d-CNN method retains its apparent high accuracy when the protein structure is replaced by a single dummy atom, i.e. when the protein structure is not even accounted for in the protein/ligand interactions. The latter result shows rather conclusively that the bias hidden in the DUD-E database fully accounts for the observed performance, and that such methods will, in a real setting, not provide the predictive accuracy that has been claimed. This is alarming, as companies have been built on the premise that CNN approaches trained on DUD-E data are superior to docking.
All this does not mean that the DUD-E data set is useless. It has arguably proven useful in evaluating molecular docking, where a first principles approach presumably would not be correlated with the decoy and analogue bias and could lead to adequate performance analysis. The problems observed and described by the CUNY group become really relevant in those cases where deep learning models are trained on the DUD-E dataset. The tendency of machine learning to “cheat”, i.e. to pick up on biases in the training data and learn unintended relations detrimental to their purpose is well known. If, in addition, the same biased data set is also used to measure the accuracy of the model, this can result in wildly exaggerated claims of performance for models that in fact may not have any value at all.
Please note, we are also not concluding that all 3d-CNN methods are flawed. With more careful selection of training data and unbiased construction of negatives, it is possible that useful 3d-CNN models can be created, and Chen et al. describe steps in that direction. The dearth of readily available structural data on 3d-complexes could possibly be overcome by advanced deep learning techniques meant to reduce data requirements, or by a concerted effort in the structural biology community to increase the amount of such data.
At Cyclica, we are passionate about the opportunity to address areas where molecular docking has fallen short. To date, we have stayed clear of 3d-CNN models, especially trained on DUD-E datasets. Our MatchMaker deep learning technology, announced in January 2019 (see link), is based on a feature representation of inferred binding pockets that does not require precise 3d structures of protein/ligand complexes, allowing a training set of millions of DTIs to be used. Furthermore, we randomly select negative examples from the exact same distribution as the positives, eliminating the possibility of decoy bias in both training and cross-validation (see link). Lastly, we have field tested MatchMaker with clients and validated its performance in proteome screening against experimental methods (see link). MatchMaker and its validation are described in a manuscript we are working on to submit for peer-review in the coming months.
Overall, we believe that AI innovations are not strictly bound to algorithm development and model building, instead requiring creative and relevant solutions for data representations, evaluations, and applications. With MatchMaker at hand, we were able to replace our reliance on conventional molecular docking in our flagship proteome screening platform Ligand Express (see link). MatchMaker also plays a critical role in our newly launched Ligand Design technology for multi-objective drug design (see link). Taken together, Ligand Design and Ligand Express, our first generation off-target profiling platform, offer a unique end to end enabling AI-augmented drug discovery platform to design advanced lead like molecules while minimizing off-target effects.
Naheed Kurji, President and CEO
Andreas Windemuth, Chief Science Officer
With thanks to our awesome team!