Machine Learning – “What if” it enabled exploratory analysis in High Content Screening

January 22nd, 2010 by Tim Moran

A roundtable discussion took place near the close of this year’s HCA meeting in San Francisco. The topics of  Data Analysis and Management,  Image Analysis and Computational Biology were folded into a single discussion. This roundtable was facilitated by Karel Kozak. Participants included:

Karel Kozak (Swiss Fed. Institute Of Technology)

Lisa Smith (Merck)

Peter Horvath (Swiss Fed. Institute Of Technology)

Achim Kirsch (PE/Evotec)

Ghislain Bonamy (Novartis GNF)

Abhay Kini (GE Healthcare)

Jonathan Sexton (North Carolina Central University)

Mark Bray (Broad Institute)

Chris Wood (Stowers Institute for Medical Research)

Pierre Turpin (Molecular Devices)

Mark Collins (ThermoFisher/Cellomics)

The opening shot from Schmerck (Lisa Smith  from Schering now Merck) was fired at the vendors. The bullet in question? “Why tools for pattern recognition and machine learning on image data were not more rapidly addressed for vendor systems?”  Vendors replied with their own question, “Why is this a better approach than algorithmic quantification of a known endpoint?” The result of the ensuing discussion was that the end-users want the ability to extract any additional information from their data that is not derived by the designed analysis algorithm, i.e., look for natural classes in the data, spot outliers, correlate to chemical structure of test compounds, etc. This does not necessarily have to be correlated to known biological endpoints – it can be purely exploratory. Vendors said “that’s why we need companies like Accelrys and products like Pipeline Pilot”. The marketplace needs a third-party environment which provides turnkey or almost-turnkey access to the data, and an exploratory environment like PLP in which users can develop methods to ask “what-if” questions of their data. When users clearly demonstrate that these techniques have merit, they will find their way into the instrument vendors’ products.

One other aspect of the above discussion which became apparent is that many, if not most, HCS users have no idea what the difference is between PCA, Classification, Support Vector Machines, genetic algorithms, Self-organizing maps, etc., let alone where or when to apply these methods. What they want, and need, is a kind of wizard which walks them through a process of determining what they want to learn from their data, and selecting internally the best method to do that. An analogy was drawn to curve-fitting programs which apply hundreds or thousands of models to a data set, and tell the user which ones produced the best fit. This idea of “opening up to the wider science community methods previously available only to discipline experts”, specifically in computational biology, is by no means in its infancy (see The Future of Computational Science, Scientific Computing World: May / June 2004).

The momentum in machine vision – learning, clustering, modeling, predicative science and ease of use was foreshadowed in the HCA East conference held in 2009 and will likely continue to be the area that enables researchers in High Content Screening and Analysis to make better informed decisions earlier in the discovery process.

Special thanks to contributing author Kurt Scudder.

  • Share/Save/Bookmark

Ensemble Models for Better Predictions

January 22nd, 2010 by Dana Honeycutt, Ph.D.

One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I’ll be talking about the latter; for details on the Netflix Prize solution, go here.

In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.

But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee’s protocol on the Pipeline Pilot forum (registration is free).

Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It’s as if many wrongs can make a right.

  • Share/Save/Bookmark

DFT Redux

January 14th, 2010 by George Fitzgerald, PhD

I thought I’d start the year with an easy blog, simply following up on my earlier ramblings of 25 October 2009: DFT Goes (Even More) Mainstream. In that article I discussed the success of Density FunctionalTheory (DFT) and used the annual number of publications as a metric. The numbers show that publications grew by over 25% per annum, but the results for 2009 were naturally incomplete.

Happily the trend continued through 2009 for a total of 4621 DFT references in ACS Journals. Here are a few of my favorite publications, thought not all are drawn from the ACS citations. Yes, of course, these use Accelrys DFT packages, but they are still pretty cool articles:

Let me and my readers know what you think are the most interesting DFT articles from 2009.

†Strictly speaking, this was not QSAR, Quantitative Structure-Activity Relationship, because they didn’t actually base predictions on the structure. I use the term here more generally to refer to relationships that predict complex properites like catalytic activity, on the basis of simpler properties, like workfunction.

  • Share/Save/Bookmark

Mad about MAD

January 8th, 2010 by Dana Honeycutt, Ph.D.

Over the past year or so, I have spent a great deal of time working with model applicability domains (MAD). Here I explain some of the what and the why.

When we build a statistical model—whether with linear regression, Bayesian classification, recursive partitioning, or some other method—we want to ensure that the model is a good one. If the goal is to make predictions with the model, then “good” means “able to make accurate predictions.” We usually use cross-validation or test set validation to convince ourselves that a model is good in this sense.

But there’s more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model’s applicability domain.

For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)

This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model’s limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model’s awareness of its own applicability domain can be critical to the proper use of the model.

In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).

To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. (“Distance” can be defined in several different ways, and a lengthy essay could be written on this subject alone. But I’ll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we’re working on adding more of these.

I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I’ll have more to say on this in a future posting.

  • Share/Save/Bookmark

Put some “Pep” in your proteins

January 8th, 2010 by Accelrys Team

Join us at the annual CHI PepTalk 2010 Conference being held at the Hotel Del Coronado in San Diego, January 11-15, 2010. At booth #32, Accelrys Protein Engineering and Antibody Modeling experts will showcase new features and enhancements found in Accelrys Discovery Studio 2.5.5 that both enable and improve the modeling of antibody structure and function. The advanced computational technology supported by the Discovery Studio environment allows scientists to explore the antibody landscape in silico prior to costly experimental implementation, thus greatly reducing the time and expense involved in bringing such products to market, while increasing scientific productivity.

On Monday, January 11th at 3:35 pm, Dr. Shikha Varma-O’Brien of Accelrys will present “Modeling the 3-Dimensional Structures of Antibody and their Interaction Interface to Antigen.” She will discuss and demonstrate how Accelrys Discovery Studio not only contains the tools necessary to construct modeling framework from antibodies, but also enables structure based prediction of antibodies by physical properties with the goal of uncovering novel antibody designs.

  • Share/Save/Bookmark

“Fueling” the Discovery of New and Alternative Materials

January 6th, 2010 by Accelrys Team

Materials Studio Webinar Series Part V: Exploring New Fuel Cell Materials

There is increasing pressure to deliver lighter, more efficient and less expensive materials more frequently and faster than ever before. Fortunately, the integration of Materials Studio applications such as CASTEP and the Pipeline Pilot platform opens a range of possibilities for the discovery of new materials.

The experts at Accelrys have developed a new framework that screens complex systems and properties across numerous materials and applications. This system is currently being applied to fuel cell catalysts to find alternatives to costly materials such as platinum. Dr. Jacob Gavartin and Dr. Gerhard Goldbeck-Wood will discuss this approach and its application in detail during next week’s webinar:

Exploring New Fuel Cell Materials: High Throughput Calculations and Data Analysis with Materials Studio 5.0 and Pipeline Pilot
January 13, 8am PST / 4pm GMT

Register today!

  • Share/Save/Bookmark