Less-Naïve Bayesian Classification

February 10th, 2010 by Dana Honeycutt, Ph.D.

The Bayesian learner in Pipeline Pilot is a so-called naïve Bayesian classifier. The “naïve” refers to the assumption that any particular feature contributes a specific amount to the likelihood of a sample being assigned to a given class, irrespective of the presence of any other features. For example, the presence of an NH2 group in a compound has the same effect on predicted activity whether or not there is also an OH or COOH group elsewhere in the compound. In other words, a naïve Bayesian classifier ignores interaction effects.

We know that in reality, interaction effects are quite common. Yet, empirically, naïve Bayesian classification models are surprisingly accurate (not to mention that they are lightning-fast to train).

But perhaps there are cases where a model with interactions would be better. How might we make the Bayesian learner less naïve? If we use molecular fingerprints as descriptors, one simple approach is to create a new fingerprint by pairing off the original fingerprint features and adding them to the list. We can then train the model on the new fingerprint with its expanded feature list.

A sparse molecular fingerprint (such as the Accelrys extended-connectivity fingerprints) consists of a list of feature IDs. These IDs are simply integers corresponding to certain substructural units. E.g., “16″ might refer to an aliphatic carbon, while “7137126″ might refer to an aryl amino group. So if our original fingerprint has the following features:

16
85
784
12662
...

our fingerprint-with-interactions would have the above features with the following ones in addition:

16$85
16$784
16$12662
85$784
85$12662
...

The “$” is just an arbitrary separator between the feature IDs. A Bayesian learner works by simply counting the features present in the two classes of samples (e.g., “active” vs. “inactive”), so the feature labels are unimportant, as long as they are unique.

To test the approach, I applied it to models of the Ames mutagenicity data that I discussed in a previous posting, and to an MAO inhibitor data set. Does it work? The short answer is, “Yes, with caveats.” Read my posting on the Pipeline Pilot forum for details (registration is free).

  • Share/Save/Bookmark

Mad about MAD

January 8th, 2010 by Dana Honeycutt, Ph.D.

Over the past year or so, I have spent a great deal of time working with model applicability domains (MAD). Here I explain some of the what and the why.

When we build a statistical model—whether with linear regression, Bayesian classification, recursive partitioning, or some other method—we want to ensure that the model is a good one. If the goal is to make predictions with the model, then “good” means “able to make accurate predictions.” We usually use cross-validation or test set validation to convince ourselves that a model is good in this sense.

But there’s more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model’s applicability domain.

For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)

This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model’s limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model’s awareness of its own applicability domain can be critical to the proper use of the model.

In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).

To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. (“Distance” can be defined in several different ways, and a lengthy essay could be written on this subject alone. But I’ll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we’re working on adding more of these.

I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I’ll have more to say on this in a future posting.

  • Share/Save/Bookmark

Good Models Require Good Data

October 1st, 2009 by Dana Honeycutt, Ph.D.

In my last posting, I touted ROC analysis as one of the best ways to evaluate and compare different methods for building classification models. To do a true apples-to-apples comparison, it also helps to have a good reference data set. In this regard, Katja Hansen et al. have done data modelers a favor by publishing a “Benchmark Data Set for in Silico Prediction of Ames Mutagenicity.” Not only did they vet and make available the data, but they also provide data splits for cross-validation to help modelers ensure that their method comparisons have a common basis.

The authors compare several techniques, including the Bayesian classifier in Pipeline Pilot. Data junkie that I am, I couldn’t resist throwing the Ames data at this and a few other Pipeline Pilot learners. Here are the results I got using the ECFP_4  molecular fingerprint as the descriptor:

Method   ROC Score
Bayesian   0.82
RP Tree   0.78
RP Forest   0.82
R SVM   0.72
kNN   0.84
     

These results show a few things. The best ROC scores in the table are comparable to those reported by Hansen et al. for various classifiers that they investigated. (The best score they obtained was 0.86 for an SVM model.) The results confirm the widely known fact that forest models give better predictive performance than single tree models. Finally, they confirm that molecular fingerprints are good descriptors for building classification models.

If you want more of the statistical details, I provide them in a posting on the Pipeline Pilot Forum at the Accelrys Community site. (Registration is free.)

  • Share/Save/Bookmark

What do the latest animal testing figures tell us?

July 31st, 2009 by Gerhard Goldbeck-Wood, PhD

The UK Home Office released new statistics on animal testing in the UK last week (21 July 2009). As reported by the BBC , the figures show a strong upward jump.  In 2008, 3.7 million procedures using animals were carried out in the UK, representing an increase by 455,000 or 14% in 2007. Digging a bit deeper into the statistics shows some interesting trends.

The increase in number comes almost entirely from a strong increase in the number of fish procedures (278,000), followed by mice (197,000).

Animaltestingchart1

Whereas others have remained stable or declined:

Animaltestingchart2

According to the report, these seem to be strongly related to increases in biological research. That’s got to do with the increasing attention and promise of personalized medicine and genetic targeting.

It’s also interesting to note that, apart from breeding related activities, medical research and pharmaceuticals safety dominate by far. Procedures relating to ecology, and substances used for example in industry, agriculture and food only sum to about 100,000, or just a few percent of the total. Also, these procedures have seen a decline overall, for example, by 20% for substances used in industry. There have been no tests at all on cosmetic substances, in fact none since 1998.

So what’s it telling us?

There have been great advances over the last 10 years in making use of alternative testing methods. In vitro testing has become more established for cosmetics, and animal testing either has been phased out already or is on its way out very soon in Europe as a result of the 7th Amendment to the  Cosmetics  Directive.

While similar trends hold for substances used in industry, it will be interesting to follow this trend in 2009 and 2010, as the majority of substance assessments for the submission of dossiers due to the REACH legislation are taking place. By anecdotal evidence, talking to folks during the recent SETAC conference, labs are getting busy, and in some cases completely booked already with REACH related work. However, that work load also includes massive amounts of data gathering, literature searching, analytical testing, as well as in-silico methods such as QSAR and read-across. Quote: “Some people will soon get in a panic about closing the data gaps.”

The question is, how are organisations accessing, processing and handling the information, and making best use of data and information in the literature that’s already out there?  How about a web based ‘workbench’ that’s geared up to support toxicologists and other scientists as well as managers in the field to gather, process, share and report that information? We’ve built out a proof-of-concept for anyone to take a look and try, which includes examples of different functions, from database and document searches, and predictive toxicology analysis to facility monitoring. The trick is that as it is built on the basis of Pipeline Pilot protocols, it’s highly configurable and extensible with almost any third party tool. So it can be built or re-modelled to match the user’s routine practices.

  • Share/Save/Bookmark