Theory and Experiment in “Step” on Semiconductors

February 8th, 2010 by George Fitzgerald, PhD

A recent news article by the University of Texas at Dallas (UTD)  highlighted recent joint work by the Department of Materials Science and Engineering and Accelrys on critical surface reactions of Silicon. The research points the way to ”improve semiconductor devices’ performance in health care and solar power applications in particular.”

Who cares? Anybody who uses chips, solar cells, or any other device containing semiconductors (in other words, all of us.)  

Insertion of Nitrogen atom is predicted to occur preferentially at the step edge of Si(111)

 How does the latest research help? A typical semiconductor device consists of a metal oxide semiconductor layer (e.g., HfO2) deposited on a silicon substrate. As explained by co-author Dr. Mat Halls, formation of an SiO2interlayer between the silicon substrate and metal oxide can decrease semiconductor performance. One approach to solving this is to introduce a nitride barrier to prevent the growth of interfacial SiO2. The ability to introduce such heteroatoms into the topmost layers of Si affords additional opportunities to tune the surface properties by enhancing chemical reactivity at these sites to form functional surfaces. But how do you get the nitrogen to stick to the surface?     

In the latest research, published in Nature Materials, used infra-red spectroscopy  to explore the possible formation mechanisms of nitride on silicon surfaces terminated by hydrogen. Calculations using density functional theory (DFT) demonstrated how stepped edges are important to formation of the nitride layers. The reaction mechanism on the stepped surface provides a means of controlling the reaction. As the authors wrote: “The ability to control the reaction … enables the realization of applications … including sensing, electrical and thermal transport, and molecular computing.” This is a beautiful demonstration of the complementarity of theory and experiment. One can deal with facts, but requires interpretation. The other provides detailed explanations at the atomic level, but sometime requires an anchor to the “real world.” Together they can do more. Wouldn’t it be great if all viewpoints could be reconciled this well?

  • Share/Save/Bookmark

Lies, damned lies and (Oracle) statistics*

February 3rd, 2010 by Ian Buchan

While investigating the costs of joins between tables in Oracle, I came across the following, seemingly curious, result.  I had two tables that were identical in content and layout, each with indexes on the same columns but when I ran the same search on both tables, the query on one table was consistently more than 25% faster than the same query on the other.  “You must have done something differently” you cry.  Well, it wasn’t exactly obvious…

Let’s start at the beginning.  I produced 2 identical tables containing a 10,000 record sample of CAP (Chemicals Available for Purchase) using the same Pipeline Pilot protocol.  The tables differed in name only: one was CapSample, the other CapSample2.  I created indexes on the CLogP and Num_H_Acceptors columns of both tables and then timed the SQL query:

SELECT count(*) FROM CapSample WHERE CLogP>5 and Num_H_Acceptors>10

over 1,000 iterations on each table (replacing CapSample with CapSample2 as appropriate).  My intention was to then measure the time of the search taking CLogP from one table and Num_H_Acceptors from the other table, joining them by the primary key CardRef column.  However the search on CapSample consistently took about 3.85 seconds per 1000 iterations while the same search on CapSample2 consistently took about 2.79 seconds.  I was the only user on the machine and I kept re-running and switching between CapSample and CapSample2 and the results were consistent.  Weird!

The first thing was to examine the execution plans.  Aha! They were different.  Both were using hash joins on the two indexes, but the order of the two index range scan searches was different for the two tables.  Obviously, the CapSample2 order was better.  But why wasn’t it choosing it for CapSample?  At this point, I noticed a note at the end of the explain plan output for CapSample2:

Note

—–

- dynamic sampling used for this statement

This wasn’t there for CapSample.  Why not?  Because I’d imported CapSample the day before and only created CapSample2 today!  During the night the statistics had been gathered automatically on CapSample.  I’d only added the indexes after creating CapSample2, so the indexes on CapSample had no statistics, even though the table did.

All I had to do was gather default statistics for both tables again.  Then, being careful to slightly change my SQL so that I didn’t hit any cached plans, I re-explained the queries on both tables and bingo! I got consistent results and they matched those for the fast search of CapSample2.  Running the searches on both tables now gave me the 2.79 seconds I’d seen earlier.

As a final sanity check, I re-timed using the search over CapSample using the original SQL and I got the original time of 3.85 seconds again.  I was hitting the cached plan: Oracle used it even though the statistics had changed.  It seems weird running two queries that look identical except for an extra space character and finding that one runs over 25% faster than the other, but that’s what happens when you have cached plans.

So the moral(s) of this tale are:

1. When you change tables significantly or add indexes, gather table statistics for the changed tables and gather index statistics for changed or new indexes.

2. Oracle’s dynamic sampling can be very good.  However, you might want to gather proper statistics immediately after changes if you are automatically gathering statistics on your tables.  Otherwise, you could find the plan changes later (when cached plans are replaced).

3. Remember to either clear cached plans or change the SQL statement slightly after you have gathered new statistics to avoid hitting old cached plans.

*See http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics

  • Share/Save/Bookmark

Machine Learning – “What if” it enabled exploratory analysis in High Content Screening

January 22nd, 2010 by Tim Moran

A roundtable discussion took place near the close of this year’s HCA meeting in San Francisco. The topics of  Data Analysis and Management,  Image Analysis and Computational Biology were folded into a single discussion. This roundtable was facilitated by Karel Kozak. Participants included:

Karel Kozak (Swiss Fed. Institute Of Technology)

Lisa Smith (Merck)

Peter Horvath (Swiss Fed. Institute Of Technology)

Achim Kirsch (PE/Evotec)

Ghislain Bonamy (Novartis GNF)

Abhay Kini (GE Healthcare)

Jonathan Sexton (North Carolina Central University)

Mark Bray (Broad Institute)

Chris Wood (Stowers Institute for Medical Research)

Pierre Turpin (Molecular Devices)

Mark Collins (ThermoFisher/Cellomics)

The opening shot from Schmerck (Lisa Smith  from Schering now Merck) was fired at the vendors. The bullet in question? “Why tools for pattern recognition and machine learning on image data were not more rapidly addressed for vendor systems?”  Vendors replied with their own question, “Why is this a better approach than algorithmic quantification of a known endpoint?” The result of the ensuing discussion was that the end-users want the ability to extract any additional information from their data that is not derived by the designed analysis algorithm, i.e., look for natural classes in the data, spot outliers, correlate to chemical structure of test compounds, etc. This does not necessarily have to be correlated to known biological endpoints – it can be purely exploratory. Vendors said “that’s why we need companies like Accelrys and products like Pipeline Pilot”. The marketplace needs a third-party environment which provides turnkey or almost-turnkey access to the data, and an exploratory environment like PLP in which users can develop methods to ask “what-if” questions of their data. When users clearly demonstrate that these techniques have merit, they will find their way into the instrument vendors’ products.

One other aspect of the above discussion which became apparent is that many, if not most, HCS users have no idea what the difference is between PCA, Classification, Support Vector Machines, genetic algorithms, Self-organizing maps, etc., let alone where or when to apply these methods. What they want, and need, is a kind of wizard which walks them through a process of determining what they want to learn from their data, and selecting internally the best method to do that. An analogy was drawn to curve-fitting programs which apply hundreds or thousands of models to a data set, and tell the user which ones produced the best fit. This idea of “opening up to the wider science community methods previously available only to discipline experts”, specifically in computational biology, is by no means in its infancy (see The Future of Computational Science, Scientific Computing World: May / June 2004).

The momentum in machine vision – learning, clustering, modeling, predicative science and ease of use was foreshadowed in the HCA East conference held in 2009 and will likely continue to be the area that enables researchers in High Content Screening and Analysis to make better informed decisions earlier in the discovery process.

Special thanks to contributing author Kurt Scudder.

  • Share/Save/Bookmark

Ensemble Models for Better Predictions

January 22nd, 2010 by Dana Honeycutt, Ph.D.

One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I’ll be talking about the latter; for details on the Netflix Prize solution, go here.

In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.

But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee’s protocol on the Pipeline Pilot forum (registration is free).

Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It’s as if many wrongs can make a right.

  • Share/Save/Bookmark

DFT Redux

January 14th, 2010 by George Fitzgerald, PhD

I thought I’d start the year with an easy blog, simply following up on my earlier ramblings of 25 October 2009: DFT Goes (Even More) Mainstream. In that article I discussed the success of Density FunctionalTheory (DFT) and used the annual number of publications as a metric. The numbers show that publications grew by over 25% per annum, but the results for 2009 were naturally incomplete.

Happily the trend continued through 2009 for a total of 4621 DFT references in ACS Journals. Here are a few of my favorite publications, thought not all are drawn from the ACS citations. Yes, of course, these use Accelrys DFT packages, but they are still pretty cool articles:

Let me and my readers know what you think are the most interesting DFT articles from 2009.

†Strictly speaking, this was not QSAR, Quantitative Structure-Activity Relationship, because they didn’t actually base predictions on the structure. I use the term here more generally to refer to relationships that predict complex properites like catalytic activity, on the basis of simpler properties, like workfunction.

  • Share/Save/Bookmark

Mad about MAD

January 8th, 2010 by Dana Honeycutt, Ph.D.

Over the past year or so, I have spent a great deal of time working with model applicability domains (MAD). Here I explain some of the what and the why.

When we build a statistical model—whether with linear regression, Bayesian classification, recursive partitioning, or some other method—we want to ensure that the model is a good one. If the goal is to make predictions with the model, then “good” means “able to make accurate predictions.” We usually use cross-validation or test set validation to convince ourselves that a model is good in this sense.

But there’s more to it than this. Even a good model makes poor predictions for samples that are too different from the samples in the training data used to build the model. In other words, the training data define the model’s applicability domain.

For example, suppose we wish to model per-capita crawfish consumption as a function of several variables, including the distance from the Mississippi River. Suppose also that our training and test sets consist solely of Louisiana residents. Even if we find that the model has good predictive ability for the test set, we would not expect it to do a good job predicting crawfish consumption in, say, Oregon (though it might do an OK job for parts of Mississippi). In other words, locations in Oregon lie outside the MAD. (See map.)

This idea appears obvious, yet models in statistical software packages often lack the ability to automatically define their own MAD and flag predictions outside the MAD as questionable. (In linear regression models, confidence and prediction bands serve this role to some extent. The bands become wider as we move away from the center of the training data.) The onus is generally on the user of the model to ensure that it is applied correctly. When the person applying the model is the same one who built it, and is thus familiar with the training data and the model’s limitations, this is not too big a problem. But when the creator and user of a model are two different people separated in space or time, a model’s awareness of its own applicability domain can be critical to the proper use of the model.

In the life sciences, it appears that the need to take the MAD into account when making predictions was first recognized for QSAR models of toxicity such as TOPKAT. TOPKAT introduced the notion of the optimum prediction space (OPS) defined by the ranges of the training set descriptors in principal component space. But the OPS is just one of several MAD measures discussed in the literature (e.g., see here, here, here, and here).

To summarize some of my own recent work in this area: In various numerical experiments, I have reproduced the research results of others who found that the distance from a test sample to samples in the training set correlates well with the model prediction error. (“Distance” can be defined in several different ways, and a lengthy essay could be written on this subject alone. But I’ll spare you for now.) This gives us the potential to estimate MAD-dependent error bars even for learning methods that do not intrinsically support them. A few of the model-building (learner) components in Pipeline Pilot now support OPS and other MAD measures, and we’re working on adding more of these.

I hope I have convinced you of the importance of paying attention to the applicability domain when making predictions with a model. I’ll have more to say on this in a future posting.

  • Share/Save/Bookmark

Put some “Pep” in your proteins

January 8th, 2010 by Accelrys Team

Join us at the annual CHI PepTalk 2010 Conference being held at the Hotel Del Coronado in San Diego, January 11-15, 2010. At booth #32, Accelrys Protein Engineering and Antibody Modeling experts will showcase new features and enhancements found in Accelrys Discovery Studio 2.5.5 that both enable and improve the modeling of antibody structure and function. The advanced computational technology supported by the Discovery Studio environment allows scientists to explore the antibody landscape in silico prior to costly experimental implementation, thus greatly reducing the time and expense involved in bringing such products to market, while increasing scientific productivity.

On Monday, January 11th at 3:35 pm, Dr. Shikha Varma-O’Brien of Accelrys will present “Modeling the 3-Dimensional Structures of Antibody and their Interaction Interface to Antigen.” She will discuss and demonstrate how Accelrys Discovery Studio not only contains the tools necessary to construct modeling framework from antibodies, but also enables structure based prediction of antibodies by physical properties with the goal of uncovering novel antibody designs.

  • Share/Save/Bookmark

“Fueling” the Discovery of New and Alternative Materials

January 6th, 2010 by Accelrys Team

Materials Studio Webinar Series Part V: Exploring New Fuel Cell Materials

There is increasing pressure to deliver lighter, more efficient and less expensive materials more frequently and faster than ever before. Fortunately, the integration of Materials Studio applications such as CASTEP and the Pipeline Pilot platform opens a range of possibilities for the discovery of new materials.

The experts at Accelrys have developed a new framework that screens complex systems and properties across numerous materials and applications. This system is currently being applied to fuel cell catalysts to find alternatives to costly materials such as platinum. Dr. Jacob Gavartin and Dr. Gerhard Goldbeck-Wood will discuss this approach and its application in detail during next week’s webinar:

Exploring New Fuel Cell Materials: High Throughput Calculations and Data Analysis with Materials Studio 5.0 and Pipeline Pilot
January 13, 8am PST / 4pm GMT

Register today!

  • Share/Save/Bookmark

Cosmetics got Chemistry

December 17th, 2009 by George Fitzgerald, PhD

Things have come a long way since the ancient Egyptians used galena (lead sulphite) as eye makeup. I spent most of last week in New York City at the annual meeting of the Society of Cosmetic Chemists. There is an amazing amount of very sophisticated chemistry going on in cosmetics.

One of the most enjoyable presentations was by Ricardo Diez of Chanel, Inc. Dr. Diez summarized developments in ‘cleansing products’ over the last 50 years. I put the term in quotation marks because what we call ’soap’ today is quite different from soap of 50 years ago. An excerpt from the New York Times of that era advised women to wash their hair no more often than about every 2 weeks. This stuff was really just your basic soap, i.e., fatty acid salts.

Dr. Diez reported that in the 1920’s German chemists created the first ’soap alternatives’ or detergents to support the textiles industry. These were the people who filed the patents “behind widely used anionic surfactants” still around today. Surfactants transformed soaps into milder, more effective cleaning agents. Over time, manufacturers made the products gentler (think Johnson’s ® ”No More Tears” ®). Then added silicones to combine shampoo and conditioner (e.g., Procter & Gamble’s “Pert Plus” ®). Finally, manufacturers combined moisturizers with the cleaners.

How did all this come about? Remember the DuPont slogan: Better living through chemistry? Some might find it frivolous to apply this expression to cosmetics. But the advances in ’soap’ have made a real improvements to peoples’ lives, making it easier and cheaper to practice good hygene – not just to keep up appearances. As a professional chemist, I’m proud to be associated with the scientists who’ve accomplished that.

  • Share/Save/Bookmark

Materials Studio 5.0: A particle-ular challenge

December 11th, 2009 by Stephen Todd, PhD

In the last of the series of my blogs on Materials Studio 5.0 functionality, I will be writing about new functionality in the mesoscale area. Back in Materials Studio 4.4, we developed a new module called Mesocite. Mesocite is a module for doing coarse-grained molecular dynamics where the elementary particle is a bead. In coarse-grained molecular dynamics, a bead typically represents several heavy atoms. This has advantages over classical molecular dynamics such as Forcite Plus as you can access longer time and length scales.

In Materials Studio 5.0, we added the capability to do Dissipative Particle Dynamics (DPD) to Mesocite. DPD is a very coarse-grained dynamics approach where the bead can represent many atoms or even molecules. We already have a module which can do DPD in Materials Studio but this has limited ability to be extended. By developing the new DPD in Mesocite, we could take advantage of the underlying MatServer environment to easily extend DPD to run in parallel and work with MaterialsScript amongst other things.

One issue we faced is that the legacy DPD tool works in reduced units whereas MatServer requires physical units. The use of reduced units is fairly standard in DPD however it makes it more difficult to relate the results back to experimentalists. Therefore, we thought that switching to physical units would be a good idea. However, there were still questions as to how customers would work with a DPD in physical units. We asked a small focus group of customers very early in the release as to how they would like to parameterize DPD calculations. All agreed that getting the results in physical units was preferable but they still wanted to set up the calculations in reduced units as they have lots of historical data they want to re-use. So, we have a new user interface which allows setup in either reduced or physical units but then converts to physical units for the calculation!

When a new piece of functionality is added to Materials Studio, I like to add a tutorial on how to apply the software, what sort of values customers should use, and how to get the most out of the software. For the new DPD functionality, I looked at several papers before settling on an application by Groot and Rabone looking at the effect of non-ionic surfactants on membrane properties. This interested me as it demonstrates the strength of mesoscale in looking at varying concentrations of different components and seeing the effect on morphology. I also realized that I could use some of the Mesocite analysis to really analyze the system for properties such as concentration profiles and examining the diffusivity of the beads across the membrane. This mapped really well to the original results and produced what I hope is an interesting tutorial.

There was another reason I chose this paper too – Groot and Rabone also looked at the effect of strain on the membrane. This wasn’t possible with the old DPD module but, using some MaterialsScript, I could strain the system and then calculate the surface tension. As I like to dabble with MaterialsScript, the lure of this was irresistible and the script I made is available from the Accelrys Community website.

One minor issue was that the system sizes and relative amounts were not clear from the paper. Luckily there are several ex-Unilever people at Accelrys, so one of my colleagues, Dr Neil Spenley, contacted one of the authors, Dr Robert D. Groot. I was also lucky that Dr Groot obviously hordes his old research work and, a day later, I had the original DPD input file in my hands!

The results from the strain calculations also show the same trends as those reported by Groot and Rabone so I was pretty happy with this work.

So that wraps up my blogs on Materials Studio 5.0. I hope to have given you an insight into some of the processes that go into making a new version of Materials Studio.

  • Share/Save/Bookmark
Older Posts »