Discover New Materials for Batteries Through Modeling

March 12th, 2010 by Accelrys Team

In the 21st century, materials and energy are more topical than ever before. Insights at the atomistic and quantum level help us to design cleaner energy sources, and find less wasteful ways of using energy. Join us on March 16th as Dr. George Fitzgerald presents “High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Battery Electrolyte Materials.”

Register to learn:

  • How modeling can support the discovery of components to enhance the performance of lithium ion battery formulations
  • How to use Materials Studio components in Pipeline Pilot to analyze and screen a materials structure library for Li-Ion battery additives
  • Results from a collaboration with Mitsubishi Chemical Inc which was also published in The Journal of Power Sources

This presentation is part of our ongoing webinar series that showcases how Accelrys products and services are transforming materials research. You can download related archived presentations in this series or register for future webinars.

We look forward to sharing our insights with you throughout this webinar series.

  • Share/Save/Bookmark

Making Sense of the Cloud for Science:Part 3

March 11th, 2010 by Conrad Agramont

Cloud and Managed Services

In the first part of this series, we discussed the basic collection of cloud offerings and what type of value they provide to IT, Developers, and Customers.  The second part explained some of the Business Issues when leveraging the cloud from with your Enterprise environment.  In this post, we’ll focus more on the various services models that are associated with the Cloud.

Even within an ASP, there will be a range of providers.  Let’s take Accelrys Pipeline Pilot (PP) for example.  It’s a product that provides rich data-flow capabilities and has a specialty in science computing.   Today, most customers deploy PP on-premise by either the Research & Development (R&D) Information Technology (IT) department or by a group of scientists.  PP makes using and managing the platform in either of these scenarios extremely easy, yet powerfully scalable.  Regardless of how easy it is to manage PP, there are other concerns one must have when managing any platform or application.  This includes maintenance, backup and recovery, security, data management, etc.  Not to mention supporting an ever growing user base also looking to leverage Pipeline Pilot.  This could result in time being taken away from your main business driver: Science!

Taking the step to move your basic deployment into the “Cloud,” such as Amazon Web Service (AWS), is an interesting first start.  Now you don’t have to worry about the Operation System and everything underneath it (e.g. hardware, cooling, power, etc.), but you’re still left with everything else.  This is where Application Service Providers (ASP) comes into play.  An ASP can come in different packages.  For one, the ASP could actually be a group internally to your business. OK, so they’re not “really” an ASP, but they could function as one as they provide the service for a given cost and they’re not directly tied to your organization.  Hey, could this be Corporate IT?  Sure, or perhaps another scientific group within your business offering their investment to another team and doing cross charging to offset the costs.  And by the way, doing this in the cloud to remove the burden and cost from IT to manage it.  Perhaps this scenario has too many moving parts for your fancy.  I’ll move on.

A more traditional ASP manages the application and perhaps even provides application level support.  Taking Pipeline Pilot as an example again, providing application support really comes in two flavors.  The first is supporting the application platform and tools themselves; for instance, if you’re writing a protocol (a set of tasks in a data pipeline) or running an application built on PP.  The other is more focused on the science itself and relating it to the product.  While there may be many that could help with the PP Platform, Infrastructure, and even the tools, it’s a big leap to also support the science.  The key here for you is, when shopping for a cloud vendor or ASP take a look at the breath of services you’ll get from them and anticipate your need for science, application, and infrastructure support.  Not to mention the difference in cloud infrastructure that requires a Message Passing Interface (MPI) infrastructure (more on that later).

If you’re in any stage of interest, planning, evaluating, or deploying Accelrys products or other scientific applications in the Cloud, we’d love to hear from you!  As the leading provider of Scientific Informatics Solutions, we’re interested in supporting our customers no matter where there environment is – at home or in the cloud.  Visit our forums to continue the discussion: http://accelrys.org/

To view all Conrad’s Cloud Series posts, please click here.

  • Share/Save/Bookmark

Making Sense of the Cloud for Science:Part 2

February 26th, 2010 by Conrad Agramont

Business Issues in the Cloud

In the first part of this series, we discussed the basic collection of cloud offerings and what type of value they provide to IT, Developers, and Customers.  In this post, we’ll focus more on the business issues when leveraging the cloud.

One of the biggest hurdles leveraging Cloud Services is around securing and transporting of the data.  There’s no single answer or solution to resolve these issues, and there is no shortage of webinars, papers, conferences, etc. that focus on this so I don’t think I need to dig into that (just yet).  But what’s important to recognize is that all of the Cloud vendors, security experts, and network providers are working to both provide an answer that meets your business and technical requirements but also earns your trust.  The best way to get over that hump is to learn more and try it out.

First try the cloud on non-critical but impactful tasks.  Then start to increase your usage of critical data, connect directly to internal data, and perform tasks that provide real business value.  This isn’t an original approach since it’s pretty much the typically evaluation or Proof of Concept (POC), but that’s exactly the point!  Driving a project like this is more than just technology based, as you’ll most likely involve many people within your organization, such as Legal, Finance, IT, and Security in order to plan and complete the project.  There will be lots of concerns from these various groups, many reasonable and some that just requires lots of education.  So make sure you invest in educating them on the basics of the Cloud first.  This will make the rest of the process much smoother, but not easier.

Second, you’ll need to consider network bandwidth usage and data storage costs.  All of the cloud vendors have some sort of fee when uploading, downloading, and storing your data.  When you first look at this, its penny’s per GB, but when dealing with large data volumes and data transactions (e.g. Read and Writing across the network) those costs can get pretty high.  So your first thought will be that cloud pricing is extremely high, but what you may not be factoring in is all the things the cloud vendor is doing for you that’s beyond just the price of the disk, network, cooling, and power.  The cloud vendors typically offer a high SLA, so that includes data replication, de-duplication, resiliency, continuity, and more.  And not to mention the staff, planning, and operations to make all of that happen.  If you compared that to your own infrastructure and added that to your internal per-GB cost of storage, you’ll most likely see that the Cloud is more affordable but that assumes your meeting the same level of SLA and process as the Cloud vendors which most are not.  That said, there are some applications and data that may not be a good fit for many of the cloud vendors because of the special nature of the application, massive data size with high volume transactions, high throughput requirements, legal requirements, and more.  But this is starting to be the exception versus the rule.

Many organizations are making the leap of putting their most trusted data into the cloud, and some are doing it without realizing the significance. Email and Sales force automation having been leading the charge in hosted applications and Software as a Service (SaaS) deployments.  Now think of it this way, if you can store all of your communications and customer records on the cloud, why can’t you do more?  By businesses taking this leap, they start to build trust in external parties maintaining and operating their business critical services.  In a recent report by Goldman Sachs, they note that customers see a “shift towards cloud unstoppable”.  http://news.cnet.com/8301-13846_3-10453066-62.html The trend towards cloud services and applications won’t be a complete rip and replace, business will look to the cloud as an extension of their overall enterprise architecture and infrastructure.

When comparing the many Cloud/IaaS vendors in the market today, it’s already moving towards mass commodity price points and common functionality.  And that’s great if you want to take a piece of existing traditional on-premise software and simply deploy it to the cloud.  What you have to look out for are pitfalls in the software license, security, deployment architectures, and the fact that you’re still responsible for managing that software in the “Cloud”.  So the next layer to look for is a Services vendor that can deliver you the application.  This can at times come from the vendor directly or through partner network supported by the vendor.  Each has their own value proposition and differences in how flexible they can be delivering additional custom services.   Again, this type of application + service model isn’t new as the Application Service Provider (ASP) model has been around for years.  What’s new is that these ASP’s can still provide lots of value and cost reduction to the customer but now leveraging computing and storage that provided by a “Cloud” offering (e.g. AWS).

In the next part of this blog series, we’ll focus more on the various services models that will be available to customers based on a cloud version of Pipeline Pilot.

To view all Conrad’s Cloud Series posts, please visit: http://blog.accelrys.com/author/conrad/

  • Share/Save/Bookmark

Making Sense of the Cloud for Science: Part 1 continued

February 22nd, 2010 by Conrad Agramont

I last left you with a list of terms and examples surrounding the term “cloud computing;” now it’s time for a little context.  Utility Computing, such as Amazon Web Services (AWS) Elastic Cloud Computing (EC2) provides a customer with the ability to spin up new machines on-demand.  From the customer side, you don’t care what machine it’s on but you do get to define the type of resources you want to consume such as CPU cores and Memory.  So far this sounds just like Hosting, right?  Correct!  What’s different is that you don’t have to sign a long term contract for that resource AND you’re not tied to that actual hardware since in the background it’s really just a Virtual Machine.  Now this is where it gets interesting.  Hosting has been around for a while, but since Server Virtualization technologies such as Microsoft Hyper-V and VMware vSphere has become mature, it enables the flexibility and architectures of Cloud Computing.  And since this Server Virtualization is available to Enterprises, this is where you hear the term “Private Cloud” being add to the Enterprise mix.

Now let me quickly tackle a common question.  “What’s the difference between Amazon Web Services, Microsoft Azure, and Salesforce?  Aren’t they all the same?”  First off, this is a great question, but it’s really comparing apples, oranges, and tomatoes.  Yes, those are all fruits but each provide something very different to the consumer.  Where Clouds are different than fruit is that you can layer some of the clouds to deliver a service.  Remember that AWS is a Utility.  Microsoft Azure is a resource targeted towards developers.  Developers are different than IT and therefore have different requirements.  They like to write applications that typically consume some data and provide a User Interface.  They don’t want to be bothered with patch management, monitoring systems, deployment of servers, etc.  Microsoft Azure abstracts this from the developer.  They instead write to the “Fabric” of the Cloud Computing platform that Microsoft manages, which allows the developer focus on what they do best.  Finally, with Salesforce.com it’s even further abstracted.  You still have developers that can write applications based on Salesforce.com, but the developer is given even more constraints on what they can develop and how it can be implemented.

OK, enough of the Cloud Tutorial, but hopefully you have an understanding that there are many different types of clouds and how they can be used.  Are there challenges to adoption? You bet!  But there are always challenges when adopting technology.  While the above was about the technology, there are a number of business issues, concerns and questions that need to be addressed as well.  In the case of many organizations, one of the biggest hurdles is around securing and transporting of the data.

In the coming weeks, we’ll provide an update on our roadmap for leveraging, supporting, and providing guidance on using Cloud Computing and Virtualization technologies.  Accelrys has already been moving forward to partner with a number of Cloud vendors, Service Providers, and third-party software vendors to ensure our customer have the power of choice, delivery models, and a clear path to leverage Accelrys products in the cloud.

If you’re in any stage of interest, planning, evaluating, or deploying Accelrys products or other scientific applications in the Cloud, we’d love to hear from you!  As the leading provider of Scientific Informatics Solutions, we’re interested in supporting our customers no matter where there environment is – at home or in the cloud.  Visit our forums to continue the discussion: http://accelrys.org/

In the next part of this blog series, I’ll focus on the Business Issues found with leveraging the cloud.

To view all Conrad’s Cloud Series posts, please visit: http://blog.accelrys.com/author/conrad/

  • Share/Save/Bookmark

Making Sense of the Cloud for Science: Part 1

February 19th, 2010 by Conrad Agramont

Cloud Technologies – A Primer

The scientific community is seeing an explosion of outsourcing, collaboration, massive data production and consumption, and financial pressures.  Driven by these challenges, Research & Development Information Technology (R&D IT) and even the scientists themselves are looking to the potential of Cloud Computing to enable an increase in science innovation and allow R&D IT to provide higher valued service along with reduced costs.  Cloud Computing isn’t a “silver bullet” to solve these challenges, but it does provides the tools to address many of these key business drivers.

I’m sure many of you have seen the benefits of the cloud such as cost reduction, cost management, on demand, and scalability.  But what does this mean in the context of a Scientist using a product such as Pipeline Pilot?  Before we can get into the specifics of how Cloud Computing will provide value to a Science organization, let’s first get the terminology straight.  This won’t be a deep dive into each area, but just a quick primer.

First off let’s just all agree that “Cloud Computing” is a pretty generic term and it actually comes in many different forms.  Here are some terms loosely used and thrown around with common examples:

  • Platform Virtualization – Virtualization of computers or operating systems. It hides the physical characteristics of a computing platform from users, instead showing another abstract computing platform.
    • VMWare vSphere, Microsoft Hyper-V, Citrix XenServer
  • Grid Computing – Combination of computer resources from multiple administrative domains applied to a common task, usually to a scientific, technical or business problem that requires a great number of computer processing cycles or the need to process large amounts of data.
    • Microsoft HPC, Sun Grid
  • Managed Hosting – A dedicated hosting service, dedicated server, or managed hosting service is a type of Internet hosting in which the client leases an entire server not shared with anyone.
  • Utility Computing (Cloud)- packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility (aka Infrastructure as a Service)
    • Amazon EC2, Rackspace Cloud, GoGrid
  • Platform as a Service (PaaS) – a computing platform and/or solution stack as a service, generally consuming cloud infrastructure and supporting cloud applications.
    • Microsoft Azure Services, Google App, Rackspace Cloud Apps
  • Software as a Service (SaaS) – model of software deployment whereby an Application Service Provider (ASP)  licenses an application to customers for use as a service on demand
    • Salesforce.com
  • Software plus Services (S+S) – combining hosted services with capabilities that are best achieved with locally running software.
    • Microsoft Exchange Hosted Services, Google Message Labs

That’s a pretty quick and dirty listing of terms, so I’ll add a little context next time…

To view all Conrad’s Cloud Series posts, please visit: http://blog.accelrys.com/author/conrad/

  • Share/Save/Bookmark

Welcome! The Discovery Studio Doors are Wide Open.

February 16th, 2010 by Accelrys Team

A new year, a new website, and a new mechanism to support our users!

We have just launched an initiative to help scientists around the world with support for Discovery Studio and its integration with Pipeline Pilot through the New Discovery Studio Open Hour!

Discovery Studio Open Hour is an open session hosted by an Accelrys scientist to answer any questions or queries you might have with regard to Discovery Studio. These sessions are completely open and FREE to attend and do not require any previous registration at all.

Drop in anytime and stay as long as you like. These sessions are open for 1 hour and you can drop in for 10 minutes or you can stay for the entire hour. Science is never black or white, so if you feel like brainstorming an idea or need to get advice on a workflow, dial in and you’ll be connected to an expert! New to Discovery Studio and don’t really know how to take advantage of this powerful architecture? Our support scientists will help demonstrate how customized solutions can be easily developed. And with the growing number of custom DS scripts and protocols on our Accelrys Community forum, you may just find what you are looking for and the DS Open Hour would be a great time for further discussions!

Have a question about the FREE DSVisualizer? This might be the best forum to get started or ask questions and have fun learning about tips and tricks with the product!

For now, sessions are hosted the second Tuesday of each month in 2010. (Jan 12, Feb 9, March 9, April 13, May 11, June 8, July 13, Aug 10, Sept 14, Oct 12, Nov 9, Dec 14) and the website has all the webex and conference call details you’ll need.  Add the dates to your calendar …  we’ll see you then!

  • Share/Save/Bookmark

Welcome to the New Accelrys.com!

February 13th, 2010 by Accelrys Team

Well, we think you would probably agree…it was time for a re-model.  We’re excited to open the doors to the new and improved Accelrys.com.

Old Site

New Site

In addition to a new look and feel, we think you’ll find it easier and faster to locate information.  We’ve organized content by a number of different categories – by Area of Science, by Scientific Need, by Industry and by Product – and added helpful ‘Next Steps’ and contextually relevant resources on every page.

One of our favorite additions is the use of video throughout the site.  These videos feature members of our Accelrys team, including our Chief Science Officer, Frank Brown, our head of Research and Development, Matt Hahn and Lalitha Subramanian who leads our Contract Research group.  We think video is a great way for you to get to know our products and our team.

Check out the Flash video demonstrating Pipeline Pilot – our scientific informatics platform.  It’s a quick 3 minute overview that sums up the measurable impact that Pipeline Pilot can have on your research process (Homepage – first video on the left).  This library will continue to grow with not just interviews, but product demos, so check back often.  And of course, there is our Blog which features active commentary from our team on a range of topics and trends impacting the scientific community.

We hope you like the new site.  Surf’s Up!

  • Share/Save/Bookmark

Less-Naïve Bayesian Classification

February 10th, 2010 by Dana Honeycutt, Ph.D.

The Bayesian learner in Pipeline Pilot is a so-called naïve Bayesian classifier. The “naïve” refers to the assumption that any particular feature contributes a specific amount to the likelihood of a sample being assigned to a given class, irrespective of the presence of any other features. For example, the presence of an NH2 group in a compound has the same effect on predicted activity whether or not there is also an OH or COOH group elsewhere in the compound. In other words, a naïve Bayesian classifier ignores interaction effects.

We know that in reality, interaction effects are quite common. Yet, empirically, naïve Bayesian classification models are surprisingly accurate (not to mention that they are lightning-fast to train).

But perhaps there are cases where a model with interactions would be better. How might we make the Bayesian learner less naïve? If we use molecular fingerprints as descriptors, one simple approach is to create a new fingerprint by pairing off the original fingerprint features and adding them to the list. We can then train the model on the new fingerprint with its expanded feature list.

A sparse molecular fingerprint (such as the Accelrys extended-connectivity fingerprints) consists of a list of feature IDs. These IDs are simply integers corresponding to certain substructural units. E.g., “16″ might refer to an aliphatic carbon, while “7137126″ might refer to an aryl amino group. So if our original fingerprint has the following features:

16
85
784
12662
...

our fingerprint-with-interactions would have the above features with the following ones in addition:

16$85
16$784
16$12662
85$784
85$12662
...

The “$” is just an arbitrary separator between the feature IDs. A Bayesian learner works by simply counting the features present in the two classes of samples (e.g., “active” vs. “inactive”), so the feature labels are unimportant, as long as they are unique.

To test the approach, I applied it to models of the Ames mutagenicity data that I discussed in a previous posting, and to an MAO inhibitor data set. Does it work? The short answer is, “Yes, with caveats.” Read my posting on the Pipeline Pilot forum for details (registration is free).

  • Share/Save/Bookmark

Lies, damned lies and (Oracle) statistics*

February 3rd, 2010 by Ian Buchan

While investigating the costs of joins between tables in Oracle, I came across the following, seemingly curious, result.  I had two tables that were identical in content and layout, each with indexes on the same columns but when I ran the same search on both tables, the query on one table was consistently more than 25% faster than the same query on the other.  “You must have done something differently” you cry.  Well, it wasn’t exactly obvious…

Let’s start at the beginning.  I produced 2 identical tables containing a 10,000 record sample of CAP (Chemicals Available for Purchase) using the same Pipeline Pilot protocol.  The tables differed in name only: one was CapSample, the other CapSample2.  I created indexes on the CLogP and Num_H_Acceptors columns of both tables and then timed the SQL query:

SELECT count(*) FROM CapSample WHERE CLogP>5 and Num_H_Acceptors>10

over 1,000 iterations on each table (replacing CapSample with CapSample2 as appropriate).  My intention was to then measure the time of the search taking CLogP from one table and Num_H_Acceptors from the other table, joining them by the primary key CardRef column.  However the search on CapSample consistently took about 3.85 seconds per 1000 iterations while the same search on CapSample2 consistently took about 2.79 seconds.  I was the only user on the machine and I kept re-running and switching between CapSample and CapSample2 and the results were consistent.  Weird!

The first thing was to examine the execution plans.  Aha! They were different.  Both were using hash joins on the two indexes, but the order of the two index range scan searches was different for the two tables.  Obviously, the CapSample2 order was better.  But why wasn’t it choosing it for CapSample?  At this point, I noticed a note at the end of the explain plan output for CapSample2:

Note

—–

- dynamic sampling used for this statement

This wasn’t there for CapSample.  Why not?  Because I’d imported CapSample the day before and only created CapSample2 today!  During the night the statistics had been gathered automatically on CapSample.  I’d only added the indexes after creating CapSample2, so the indexes on CapSample had no statistics, even though the table did.

All I had to do was gather default statistics for both tables again.  Then, being careful to slightly change my SQL so that I didn’t hit any cached plans, I re-explained the queries on both tables and bingo! I got consistent results and they matched those for the fast search of CapSample2.  Running the searches on both tables now gave me the 2.79 seconds I’d seen earlier.

As a final sanity check, I re-timed using the search over CapSample using the original SQL and I got the original time of 3.85 seconds again.  I was hitting the cached plan: Oracle used it even though the statistics had changed.  It seems weird running two queries that look identical except for an extra space character and finding that one runs over 25% faster than the other, but that’s what happens when you have cached plans.

So the moral(s) of this tale are:

1. When you change tables significantly or add indexes, gather table statistics for the changed tables and gather index statistics for changed or new indexes.

2. Oracle’s dynamic sampling can be very good.  However, you might want to gather proper statistics immediately after changes if you are automatically gathering statistics on your tables.  Otherwise, you could find the plan changes later (when cached plans are replaced).

3. Remember to either clear cached plans or change the SQL statement slightly after you have gathered new statistics to avoid hitting old cached plans.

*See http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics

  • Share/Save/Bookmark

Ensemble Models for Better Predictions

January 22nd, 2010 by Dana Honeycutt, Ph.D.

One approach to building a predictive model is to choose a powerful technique such as a neural network (NN) or support vector machine (SVM) algorithm and then tune the model-building parameters to maximize the predictive performance. Over the past 15 years or so, an increasingly popular alternative is to combine the predictions of multiple different techniques into a consensus or ensemble model, without necessarily optimizing each individual model within the ensemble. This is the approach that won the million dollar Netflix Prize last year, as well as the zero dollar challenge from the November 2009 Pipeline Pilot newsletter. I’ll be talking about the latter; for details on the Netflix Prize solution, go here.

In brief, the Pipeline Pilot Challenge was to find the model-building technique that gives the best ROC score for a particular classification problem. When we formulated the problem, we figured people would apply the various different learner components in Pipeline Pilot, and probably come up with a solution involving an SVM, Bayesian, or recursive partitioning (RP) model.

But winner Lee Herman took a clever alternative approach. He built four different models using four dissimilar techniques: Bayesian, RP (a multi-tree forest), mixture discriminant analysis, and SVM. For making predictions on the test set, he summed the predictions from each of the models to get a composite score. This ensemble model gave a better ROC score than any of the individual models contributing to it. For details, see Lee’s protocol on the Pipeline Pilot forum (registration is free).

Why does this work? In essence, each type of model captures some aspect of the relationship between the descriptors and what we wish to predict, while having its own distinct errors and biases. To the extent that the errors are uncorrelated between models, they cancel rather than reinforce each other. Thus the accuracy of the whole becomes greater than the greatest accuracy of any of its parts. It’s as if many wrongs can make a right.

  • Share/Save/Bookmark
Older Posts »