Advances in machine learning and artificial intelligence (AI) are transforming insight across a variety of scientific fields and industries. In this article, we explore the power of these advances to change the way information and analysis are used in drug discovery and biotechnology. We discuss how data and models work together to accelerate insight, the developing infrastructure that can bring them together, and why domain experts are critical for enabling the application of artificial intelligence to scientific questions.
Generations of Artificial Intelligence: Deep Learning, Machine Learning and Statistics
AI is a term used so broadly in technical literature and popular culture, it is important for us to define what it means in the context of this article. We consider AI to be any automated system which performs tasks that would normally require human intelligence. This definition is intentionally broad, and is designed to communicate that AI is simply another tool that scientists can use to gain insights. We categorize AI approaches into three families: Statistical inference, machine learning and deep learning.
The first AI approach, direct statistical inference, is already used widely across the field. Application of, for example, T-tests or regression analysis allows scientists to determine if data compiled through a series of experiments can directly answer a critical scientific question.
Building on statistical inference, machine learning comprises methods such as logistic regression, Bayesian modeling, decision trees, random forests, and support vector machines. These employ statistical methods to find relationships between features or labels within the data, which are identified by scientists and can be detected or computed from the data. The power of machine learning is to assign statistical weights, or a measure of importance, to each feature of the data in order to create a mathematical model which can be used to gain insights from added data. The ability of scientists to combine machine learning and domain expertise to identify useful features is essential to gaining insight.
The third generation is deep learning, a powerful approach that applies machine learning methods on a massive scale. Deep learning methods capture meaning from subtle correlations that are difficult for humans to discover, and from complex higher-order interactions that are very difficult to model. Instead of putting emphasis on feature engineering, deep learning methods intrinsically identify the most informative features from underlying data. As a result, data, and as much data as possible, drawn from the greatest diversity of real conditions, and with high-quality labels, is paramount to making these methods work most effectively.
All of these generations of AI remain important tools for discovery. Each has its own strengths and weaknesses, and each has domains in which they are most useful. It is important to have a diverse toolkit to use in driving scientific discovery than to assume that a single method defines discovery. Adopting a flexible approach can pay significant dividends when it comes to maximizing the value of data found.
A Virtuous Cycle Between Model Trainers, Scientific Users and Data
Historically, informatics software has been separated into producers and consumers. On the producer side, academic and industry developers write tools, while on the consumer side, researchers apply these methods to their own data to drive insight. Two recent developments in AI suggest that a powerful network effect can be used to drive simultaneous development and use of AI methods.
The first development is the emergence of deep learning and machine learning frameworks (such as TensorFlow, Keras, and PyTorch) and infrastructures (platforms like DNAnexus) which allow scientists to tap into complex systems without needing to develop their own.
The second is the increasing use of large and diverse data to drive the quality of newly developed methods and, in turn, apply the models back to the data. Larger, more diverse, and higher quality datasets can be used to train better machine learning methods, which in turn can be used to generate insight.
One consequence of this is the intriguing possibility for a new type of developer-scientist: the role of a model trainer. The model trainer uses domain knowledge to identify the best, most representative datasets that capture the full scope of a problem. Then, using the frameworks that already exist, they train models that can be applied to datasets. These scientists don’t need to have all of the engineering training to actually write the frameworks and instead can directly apply their expertise to apply the frameworks in the best ways to their data. Another consequence of this is that methods can continually improve as data accumulates in the environment. The better the models, the more users are drawn to bring their own data to draw insights. By doing so, there is an opportunity to pull in that data and the insights it generates to train improved models. The models can also generate derived data and predictions that can further improve other methods. This positive feedback cycle suggests a future where methods can rapidly improve directly through their use and adoption by the community, rather than requiring long and costly software development cycles to release new and improved versions of the software.