Less-Naïve Bayesian Classification

February 10th, 2010 by Dana Honeycutt, Ph.D.

The Bayesian learner in Pipeline Pilot is a so-called naïve Bayesian classifier. The “naïve” refers to the assumption that any particular feature contributes a specific amount to the likelihood of a sample being assigned to a given class, irrespective of the presence of any other features. For example, the presence of an NH2 group in a compound has the same effect on predicted activity whether or not there is also an OH or COOH group elsewhere in the compound. In other words, a naïve Bayesian classifier ignores interaction effects.

We know that in reality, interaction effects are quite common. Yet, empirically, naïve Bayesian classification models are surprisingly accurate (not to mention that they are lightning-fast to train).

But perhaps there are cases where a model with interactions would be better. How might we make the Bayesian learner less naïve? If we use molecular fingerprints as descriptors, one simple approach is to create a new fingerprint by pairing off the original fingerprint features and adding them to the list. We can then train the model on the new fingerprint with its expanded feature list.

A sparse molecular fingerprint (such as the Accelrys extended-connectivity fingerprints) consists of a list of feature IDs. These IDs are simply integers corresponding to certain substructural units. E.g., “16″ might refer to an aliphatic carbon, while “7137126″ might refer to an aryl amino group. So if our original fingerprint has the following features:

16
85
784
12662
...

our fingerprint-with-interactions would have the above features with the following ones in addition:

16$85
16$784
16$12662
85$784
85$12662
...

The “$” is just an arbitrary separator between the feature IDs. A Bayesian learner works by simply counting the features present in the two classes of samples (e.g., “active” vs. “inactive”), so the feature labels are unimportant, as long as they are unique.

To test the approach, I applied it to models of the Ames mutagenicity data that I discussed in a previous posting, and to an MAO inhibitor data set. Does it work? The short answer is, “Yes, with caveats.” Read my posting on the Pipeline Pilot forum for details (registration is free).

  • Share/Save/Bookmark

Lies, damned lies and (Oracle) statistics*

February 3rd, 2010 by Ian Buchan

While investigating the costs of joins between tables in Oracle, I came across the following, seemingly curious, result.  I had two tables that were identical in content and layout, each with indexes on the same columns but when I ran the same search on both tables, the query on one table was consistently more than 25% faster than the same query on the other.  “You must have done something differently” you cry.  Well, it wasn’t exactly obvious…

Let’s start at the beginning.  I produced 2 identical tables containing a 10,000 record sample of CAP (Chemicals Available for Purchase) using the same Pipeline Pilot protocol.  The tables differed in name only: one was CapSample, the other CapSample2.  I created indexes on the CLogP and Num_H_Acceptors columns of both tables and then timed the SQL query:

SELECT count(*) FROM CapSample WHERE CLogP>5 and Num_H_Acceptors>10

over 1,000 iterations on each table (replacing CapSample with CapSample2 as appropriate).  My intention was to then measure the time of the search taking CLogP from one table and Num_H_Acceptors from the other table, joining them by the primary key CardRef column.  However the search on CapSample consistently took about 3.85 seconds per 1000 iterations while the same search on CapSample2 consistently took about 2.79 seconds.  I was the only user on the machine and I kept re-running and switching between CapSample and CapSample2 and the results were consistent.  Weird!

The first thing was to examine the execution plans.  Aha! They were different.  Both were using hash joins on the two indexes, but the order of the two index range scan searches was different for the two tables.  Obviously, the CapSample2 order was better.  But why wasn’t it choosing it for CapSample?  At this point, I noticed a note at the end of the explain plan output for CapSample2:

Note

—–

- dynamic sampling used for this statement

This wasn’t there for CapSample.  Why not?  Because I’d imported CapSample the day before and only created CapSample2 today!  During the night the statistics had been gathered automatically on CapSample.  I’d only added the indexes after creating CapSample2, so the indexes on CapSample had no statistics, even though the table did.

All I had to do was gather default statistics for both tables again.  Then, being careful to slightly change my SQL so that I didn’t hit any cached plans, I re-explained the queries on both tables and bingo! I got consistent results and they matched those for the fast search of CapSample2.  Running the searches on both tables now gave me the 2.79 seconds I’d seen earlier.

As a final sanity check, I re-timed using the search over CapSample using the original SQL and I got the original time of 3.85 seconds again.  I was hitting the cached plan: Oracle used it even though the statistics had changed.  It seems weird running two queries that look identical except for an extra space character and finding that one runs over 25% faster than the other, but that’s what happens when you have cached plans.

So the moral(s) of this tale are:

1. When you change tables significantly or add indexes, gather table statistics for the changed tables and gather index statistics for changed or new indexes.

2. Oracle’s dynamic sampling can be very good.  However, you might want to gather proper statistics immediately after changes if you are automatically gathering statistics on your tables.  Otherwise, you could find the plan changes later (when cached plans are replaced).

3. Remember to either clear cached plans or change the SQL statement slightly after you have gathered new statistics to avoid hitting old cached plans.

*See http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics

  • Share/Save/Bookmark

Ensemble Models for Better Predictions

January 22nd, 2010 by Dana Honeycutt, Ph.D.

One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I’ll be talking about the latter; for details on the Netflix Prize solution, go here.

In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.

But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee’s protocol on the Pipeline Pilot forum (registration is free).

Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It’s as if many wrongs can make a right.

  • Share/Save/Bookmark

Mad about MAD

January 8th, 2010 by Dana Honeycutt, Ph.D.

Over the past year or so, I have spent a great deal of time working with model applicability domains (MAD). Here I explain some of the what and the why.

When we build a statistical model—whether with linear regression, Bayesian classification, recursive partitioning, or some other method—we want to ensure that the model is a good one. If the goal is to make predictions with the model, then “good” means “able to make accurate predictions.” We usually use cross-validation or test set validation to convince ourselves that a model is good in this sense.

But there’s more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model’s applicability domain.

For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)

This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model’s limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model’s awareness of its own applicability domain can be critical to the proper use of the model.

In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).

To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. (“Distance” can be defined in several different ways, and a lengthy essay could be written on this subject alone. But I’ll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we’re working on adding more of these.

I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I’ll have more to say on this in a future posting.

  • Share/Save/Bookmark

DTC Genetic Testing: Take 2

October 29th, 2009 by Nancy Miller Latimer, M.S.

I am trying to concentrate on creating a “Biomarkers” poster for the Chemical and Biological Defense Science and Technology Conference that I am attending  next month in Dallas.  However, I have a hard time resisting my email.  Just now, I received GenomeWeb Daily News that contained a blurb about:  Amway to Sell Interleukin Genetics Health Tests, October 29, 2009.

I thought, “Is this really the Amway that my neighbor tried to get me to sell 30 years ago by telling me how great their laundry powder was?”  Yes, it is.  Can anyone have any doubt that the genomic era has arrived?

An excerpt:

“The Weight Management Genetic Test is used in a program to determine if an individual is likely to lose weight more from low-calorie or balanced diets, or from increased exercise based on genotype.

The Heart Health Genetic Test uses variations in the IL1 gene in order to determine predisposition for inflammation, which has been implicated as a risk factor for heart disease, the company said.

The Nutritional Needs Genetic Test uses variations in genes related to B-vitamin metabolism and potential cell damage due to oxidative stress, and the Bone Health Genetic Test, which is expected to be available by the end of 2009, identifies susceptibilities to spine fractures and low bone mineral density associated with osteoporosis.”

This is not necessarily a new phenomenon and there are lots of folks that feel they need to protect the public from spending money on these DTC tests.  I find it interesting, however, that no one feels compelled to press the government or FDA to legislate the height of my red-spike high heels or how much my husband should be allowed to pay for them.  We know these shoes wreck havoc on my back and knees, yet my husband will happily pay hundreds of dollars if he can only get me to wear them!  And what about all those promises about the face cream that will make me look 10 years younger.

I am all for DTC genetic tests.  I am still waiting on a few specific SNPs to be incorporated in the report before I send my spit to 23andme.  Amway’s tests are very simple and, to me, are a new twist to DTC genetic testing.  It is not necessarily about medicine but choices that I as a consumer should be allowed to make.  I want to know how much will these tests cost?  I don’t gamble but I am certainly into recreational genetic tests.  Call me weird, call me Harriet, just make sure you call me eXXcited!  Bring on the soap, baby.  I’m ready.

  • Share/Save/Bookmark

BI’s Dirty Little Secret

October 8th, 2009 by Accelrys Team

Business Intelligence has been around for nearly 30 years. So, that means that businesses have pretty much mastered  all of their data mining and management issues. Right? Well, then why do R&D enterprises still struggle with integrating and fully leveraging their scientific data for the knowledge it contains?  Accelrys’ VP of Marketing, Bill Stevens, was recently interviewed by Mary Jo Nott, Executive Editor of the BeyeNetwork, on their Executive Spotlight Program. Bill, a BI industry veteran, exposes the unique characteristics of scientific data, and explains why it has eluded the BI umbrella of solutions for so long.

BeyeNETWORK Spotlight – Bill Stevens, Accelrys

  • Share/Save/Bookmark

Good Models Require Good Data

October 1st, 2009 by Dana Honeycutt, Ph.D.

In my last posting, I touted ROC analysis as one of the best ways to evaluate and compare different methods for building classification models. To do a true apples-to-apples comparison, it also helps to have a good reference data set. In this regard, Katja Hansen et al. have done data modelers a favor by publishing a “Benchmark Data Set for in Silico Prediction of Ames Mutagenicity.” Not only did they vet and make available the data, but they also provide data splits for cross-validation to help modelers ensure that their method comparisons have a common basis.

The authors compare several techniques, including the Bayesian classifier in Pipeline Pilot. Data junkie that I am, I couldn’t resist throwing the Ames data at this and a few other Pipeline Pilot learners. Here are the results I got using the ECFP_4  molecular fingerprint as the descriptor:

Method   ROC Score
Bayesian   0.82
RP Tree   0.78
RP Forest   0.82
R SVM   0.72
kNN   0.84
     

These results show a few things. The best ROC scores in the table are comparable to those reported by Hansen et al. for various classifiers that they investigated. (The best score they obtained was 0.86 for an SVM model.) The results confirm the widely known fact that forest models give better predictive performance than single tree models. Finally, they confirm that molecular fingerprints are good descriptors for building classification models.

If you want more of the statistical details, I provide them in a posting on the Pipeline Pilot Forum at the Accelrys Community site. (Registration is free.)

  • Share/Save/Bookmark

Do You Have Your Life Preserver?

September 28th, 2009 by Nancy Miller Latimer, M.S.

Scott Markel’s article, “Drowning Research Scientists, Meet Life Preserver,” found in the Sep 16, 2009 version of Drug Discovery & Development makes an impressive case for using pipelining technology in bioinformatics research community and in the broader biomarker and translational research communities. As he points out, there will never be a one-size fits all research approach for these scientific communities. The sheer volume of data sources and open source and third party integration opportunities just continue to grow and Pipeline Pilot, a leader in data pipelining, is uniquely capable of handling this challenge.

I loved his conclusion:
Rather than relying on standard templates, users should be able to configure what they want to see and how it is presented. This degree of flexibility leaves room for the innovation so vital to these initiatives, while still providing a framework for faster decision-making and ultimately faster results.

Scott is a Vice-President and member of the Board of Directors of the International Society for Computational Biology. Scott is also the head of ACCL’s talented biosciences R&D team and developer/architect extraordinaire. I get paid to work with him. Lucky me.

  • Share/Save/Bookmark

Let’s ROC

September 14th, 2009 by Dana Honeycutt, Ph.D.

In the field of machine learning, a binary classification model is a statistical method for assigning an object to one of two categories (classes): benign vs. malignant, active vs. inactive, crystalline vs. noncrystalline, and so on. We build these models to reduce the number of experiments we need to run or to reduce the human labor required to evaluate experimental data (such as image data). The models are rarely perfect—meaning that they generally assign at least some objects to the wrong category.

In evaluating the model quality, the number of metrics we can look at is vast, with names such as: accuracy, precisionspecificity, sensitivity (a.k.a. recall), positive predictive value, negative predictive value, Cohen’s kappa, F-measure, and more. But if you asked me to judge a binary classifier’s predictive power based on a single number, that number would be the ROC area-under-the-curve (AUC) score on a test data set.

The ROC AUC score comes from a ROC plot, which is simply a plot of the true positive rate (sensitivity) against the false positive rate (1 − specificity). We generate the points on the plot by varying the cutoff value we apply to the model’s output to distinguish between the predicted classes. (Note that most so-called classification models are at root ranking models, which output a numerical score corresponding to the relative likelihood of the object being in one class versus the other.) Here’s a typical ROC plot:

ROC plot from Pipeline Pilot for Bayesian model of NCI AIDS data

ROC plot for Pipeline Pilot Bayesian model of NCI AIDS data

Each point on the plot tells you this: “For a true positive rate given by the Y axis value, the X axis value is the price you must pay in false positives.” It is then up to you to decide what the best tradeoff is and to set the cutoff accordingly. Or you may decide that none of the points on the curve give you the combination of sensitivity and specificity you need, and that you need a better model.

As you might infer from the name, the ROC AUC score is just the area under the ROC curve. It ranges in value from 0.5 for a model that’s no better than random guessing to 1.0 for a perfect model. Unlike the other metrics I mentioned above, the ROC score is independent of any specific cutoff value. Because of this, its value is an intrinsic property of the model (for a given test set). It does not depend on any preference we might have for, say, reducing the number of false negatives at the price of more false positives. It gives us a single value that we can use to easily compare the performance of different classification methods or to tune the performance of a given method.

  • Share/Save/Bookmark

Next Generation Sequencing—Pass the Pipeline, Please

August 25th, 2009 by Nancy Miller Latimer, M.S.

The drug discovery landscape has morphed into a larger playing field.  Move over small molecule; make room for biologics, diagnostics, biomarkers, and translational research–all key players as medicine gets personal and the division between research and the clinic narrows.  The “one drug, many people” paradigm now has a sibling “one person – just the right drug” paradigm.   Deep sequencing is a technology that figures prominently in the new birth.

As large pharma scrambles to figure out what the new landscape will mean for them, they (1) have reorganized along therapeutic lines with translational research and biomarker departments and (2) have placed their orders for or taken delivery of the next generation of sequencers. And this is not just pharma that is interested.  Biotech, academic and commercial core sequencing facilities, and government research organizations worldwide are actively acquiring the next generation of sequencers—no one wants to be left out.  Obama has pumped billions of dollars into US research.  At some point, sizable chunks will find their way into sequencing facilities[1].

The deep sequencing technologies have moved very fast, while the price of sequencing a human genome has plummeted.  It is no surprise that this inverse relationship is fueling sequencer sales and some anxiety about analyzing all those reads.  The price of sequencing a human genome will soon be under $1K.  “…the much-discussed goal of the $1,000 genome could be attained in two or three years. That is the cost, experts have long predicted, at which genome sequencing could start to become a routine part of medical practice.”[2] An intense desire for an unprecedented “look” into the genome, coupled with analytic inexperience, has created an unmet need in the marketplace.  Hardware vendors are keen to make NGS[3] data analysis as user-friendly as possible, setting the stage for a perfect Pipeline Pilot application.

Accelrys has been investing in a pipelining solution for NGS for over a year now.  Pipeline Pilot is an integration backbone for many hard-working scientific “collections” and third party applications.  Many IT and domain scientists are “hooked” on Pipeline Pilot to deploy robust and easily modified “protocols”.  I know from firsthand experience about the Pipeline Pilot “addiction”.  Oh, and if you aren’t aware, Pipeline Pilot is not just about chemistry anymore.  Complementing our released collections for Imaging, Sequence Analysis, Gene Expression, Plate Data Analytics, and Mass Spec for Proteomics, we look forward to releasing on our first version of our NGS Collection.   You can expect the same drag and drop functionality that you have come to expect and enjoy with the other Pipeline Pilot Collections.

In our first version of our NGS collection, we are making choices about which use cases to support. I am in the process of collecting input from our customers and those awaiting this new product—but I’d like to hear from you, too.  Do your plans for NGS analysis include cloud computing?  If you would like to participate in this survey, please send your name and contact information to me at nlatimer@accelrys.com.  Also, if you are interested in being an alpha or beta tester, let me know!


[1] http://blogs.wsj.com/health/2009/07/08/genetist-francis-collins-nominated-to-head-nih/

[2] http://www.nytimes.com/2009/08/11/science/11gene.html?_r=1&hp

[3] NGS stands for Next Generation Sequencing

  • Share/Save/Bookmark
Older Posts »