An Ecosystem for Machine Learning in Functional Genomics — Kipoi Model Zoo
One example of this concept in action is the Kipoi Model Zoo developed in collaboration between groups at the European Bioinformatics Institute, Technical University of Munich and Stanford University. Kipoi contains an environment where researchers can upload their machine learning models with a specific focus on building models around molecular phenotype and functional genomics. The model zoo relies heavily on data from ENCODE as a base. The models in it allow researchers to predict a wide variety of applications, for example, whether a variant in a dataset disrupts a transcription factor binding site, or what the chromatin state for a region is likely to be.
Sites generating datasets similar to ENCODE (such as new CHiP-Seq, Hi-C, or ATAC-Seq datasets in different disease states, cell types or organisms) will be able to apply these data to train new models in the same way. If these models are contributed back to Kipoi or a similar framework, they will expand the methods that can quickly be applied to evaluate new data. At the same time, any researcher will be able to develop new models and methods for the data already available, ensuring that the available tools continually expand. Working with the authors of Kipoi, an execution environment was built for its models to be able to run on any data within the infrastructure platform, which allows fast iteration of this cycle as the data and models live on a single platform. In addition, these models can be applied at scale to increasingly larger datasets that live on the platform.
The Power of Combining Data and Models
To demonstrate the power of having data and models on a single platform, we combined the models in Kipoi Model Zoo with some of the extensive cancer genome data on the St. Jude Pediatric Cancer Cloud. Developed by St. Jude Children’s Research Hospital in conjunction with strategic partners, St. Jude Cloud is the world’s largest public repository of pediatric cancer genomics data and offers unique analysis tools and visualizations in a secure cloud-based environment.
Starting from a set of somatic variants called across 700 tumor-normal pairs in multiple cancer types, the available 500-plus DeepBind Transcription Factor models in Kipoi were applied to predict whether variants disrupt transcription factor binding sites. This method revealed a number of transcription factors which were more or less likely to be disrupted by somatic variants when compared to match control positions nearby to the variants. Not only is this discovery is scientifically interesting, this large-scale analysis completed in less than a day, because both the data and the models are in a single scalable computing environment. Having data in standard formats, models that can read these formats, and an infrastructure that allows their rapid combination can dramatically reduce time to insight.
Furthermore, if these results motivate follow-up experiments to confirm the predictions, it may produce new RNA-Seq and CHiP-Seq data which can be used to further improve the models, leading to the next iteration of the data-model-insight cycle. With new and diverse data types available, model trainers with sufficient domain expertise can rapidly improve deep learning and machine learning models by retraining models on new, biologically relevant data.
Unlocking the Importance of Domain Expertise — Collaborations to Improve DeepVariant
Deep learning methods are powerful because they are able to discover features within the data that a human would not be able to consider, either because they are impossible to enumerate, don’t seem significant, or can’t be easily identified. However, to do this, the deep learning methods must be able to see data which represents as great a diversity of the problem as possible. For this, model trainers must have sufficient domain expertise to be able to identify diverse data types that are relevant to their field.
Collaborations with GoogleBrain around DeepVariant have demonstrated how domain expertise drives improvement to deep learning models. Google developed DeepVariant as a deep learning-based variant caller which performed with the highest overall accuracy in the PrecisionFDA Truth Challenge, beating out other industry recognized methods such as GATK. After winning the competition, Google wanted to see if DeepVariant could be further assessed and improved by identifying diverse datasets. It was subsequently found that DeepVariant and most other variant callers perform poorly on sequencing preparation that uses PCR. When DeepVariant was trained, it had only been trained on PCR-Free sequencing preparations.
One powerful principle of deep learning is that once a realization like this is made, the method can be improved without needing to write a single additional line of code. By simply providing training genomes, which had been prepared with PCR, the DeepVariant model is able to learn the error profiles characteristic in PCR-prepared data.
The key to driving improvements in deep learning models is the domain knowledge about genomics. An expert had to recognize that data representing PCR library preparation might not have been captured in the training datasets considered, and by adding PCR data to the training sets could provide deep learning models new insight to the problem. The deep learning approaches may be able to quickly learn from the data, but an expert is needed to find what should go into their curriculum. Rather than replacing domain experts as a black box application, deep learning has the potential to become a powerful tool for domain experts to drive further insights. As more data is generated, the ability to collaborate between domain experts and machine learning experts will be paramount.
From statistical inference to deep learning, artificial intelligence is a powerful set of tools for scientists to use to drive discovery. However, artificial intelligence is no magic bullet — it’s reliant on the data that is available and, as more data becomes available for the development of new methods, domain expertise will be critical for knowing which data and methods to use for the best results. In turn, domain experts turned model trainers will be able to rapidly improve methods by supplying larger and more diverse data, creating a self-driving loop between models and data to generate insight. Fully harnessing that power will require establishing an environment where data, infrastructure, and expertise can seamlessly combine in as streamlined a manner as possible. Such a framework can significantly accelerate both method development and scientific discovery, as domain experts and machine learning experts collaborate together on driving the data-model-insight loop.