Ensemble Models for Better Predictions

January 22nd, 2010 by Dana Honeycutt, Ph.D.

One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I’ll be talking about the latter; for details on the Netflix Prize solution, go here.

In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.

But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee’s protocol on the Pipeline Pilot forum (registration is free).

Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It’s as if many wrongs can make a right.

  • Share/Save/Bookmark

Good Models Require Good Data

October 1st, 2009 by Dana Honeycutt, Ph.D.

In my last posting, I touted ROC analysis as one of the best ways to evaluate and compare different methods for building classification models. To do a true apples-to-apples comparison, it also helps to have a good reference data set. In this regard, Katja Hansen et al. have done data modelers a favor by publishing a “Benchmark Data Set for in Silico Prediction of Ames Mutagenicity.” Not only did they vet and make available the data, but they also provide data splits for cross-validation to help modelers ensure that their method comparisons have a common basis.

The authors compare several techniques, including the Bayesian classifier in Pipeline Pilot. Data junkie that I am, I couldn’t resist throwing the Ames data at this and a few other Pipeline Pilot learners. Here are the results I got using the ECFP_4  molecular fingerprint as the descriptor:

Method   ROC Score
Bayesian   0.82
RP Tree   0.78
RP Forest   0.82
R SVM   0.72
kNN   0.84
     

These results show a few things. The best ROC scores in the table are comparable to those reported by Hansen et al. for various classifiers that they investigated. (The best score they obtained was 0.86 for an SVM model.) The results confirm the widely known fact that forest models give better predictive performance than single tree models. Finally, they confirm that molecular fingerprints are good descriptors for building classification models.

If you want more of the statistical details, I provide them in a posting on the Pipeline Pilot Forum at the Accelrys Community site. (Registration is free.)

  • Share/Save/Bookmark

Let’s ROC

September 14th, 2009 by Dana Honeycutt, Ph.D.

In the field of machine learning, a binary classification model is a statistical method for assigning an object to one of two categories (classes): benign vs. malignant, active vs. inactive, crystalline vs. noncrystalline, and so on. We build these models to reduce the number of experiments we need to run or to reduce the human labor required to evaluate experimental data (such as image data). The models are rarely perfect—meaning that they generally assign at least some objects to the wrong category.

In evaluating the model quality, the number of metrics we can look at is vast, with names such as: accuracy, precisionspecificity, sensitivity (a.k.a. recall), positive predictive value, negative predictive value, Cohen’s kappa, F-measure, and more. But if you asked me to judge a binary classifier’s predictive power based on a single number, that number would be the ROC area-under-the-curve (AUC) score on a test data set.

The ROC AUC score comes from a ROC plot, which is simply a plot of the true positive rate (sensitivity) against the false positive rate (1 − specificity). We generate the points on the plot by varying the cutoff value we apply to the model’s output to distinguish between the predicted classes. (Note that most so-called classification models are at root ranking models, which output a numerical score corresponding to the relative likelihood of the object being in one class versus the other.) Here’s a typical ROC plot:

ROC plot from Pipeline Pilot for Bayesian model of NCI AIDS data

ROC plot for Pipeline Pilot Bayesian model of NCI AIDS data

Each point on the plot tells you this: “For a true positive rate given by the Y axis value, the X axis value is the price you must pay in false positives.” It is then up to you to decide what the best tradeoff is and to set the cutoff accordingly. Or you may decide that none of the points on the curve give you the combination of sensitivity and specificity you need, and that you need a better model.

As you might infer from the name, the ROC AUC score is just the area under the ROC curve. It ranges in value from 0.5 for a model that’s no better than random guessing to 1.0 for a perfect model. Unlike the other metrics I mentioned above, the ROC score is independent of any specific cutoff value. Because of this, its value is an intrinsic property of the model (for a given test set). It does not depend on any preference we might have for, say, reducing the number of false negatives at the price of more false positives. It gives us a single value that we can use to easily compare the performance of different classification methods or to tune the performance of a given method.

  • Share/Save/Bookmark