Thursday, September 13, 2012

jSeries on BigCompute: Molecular Dynamics

With all the interest and focus on Big Data these days, it's easy to miss the the other side of the coin for computational challenge - compute-intensive applications. These applications are widely used across scientific domain for simulation and modeling of the physical world. They can also draw upon a lot of data so there is overlap with Big Data problem (example, simulation of wind turbine using historical and predictive algorithms for weather data).

So today I will start a series of blog called jSeries on BigCompute to examine application of HPC in science, engineering and commercial fields.

Molecular dynamics (MD) is a computer simulation of physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a period of time, giving a view of the motion of the atoms. In the most common version, the trajectories of molecules and atoms are determined by numerically solving the Newton's equations of motion for a system of interacting particles, where forces between the particles and potential energy are defined by molecular mechanics force fields. The method was originally conceived within theoretical physics in the late 1950s and early 1960s , but is applied today mostly in materials science and the modeling of biomolecules.

Application in Sciences

Molecular dynamics is used in many fields of science.
  • First macromolecular MD simulation published (1975) was a folding study of Bovine Pancreatic Trypsine Inhibitor. This is one of the best studied proteins in terms of folding and kinetics. Its simulation published in Nature paved the way for understanding protein motion as essential in function and not just accessory.
  • MD is the standard method to treat collision cascades in the heat spike regime, i.e. the effects that energetic neutron and ion irradiation have on solids an solid surfaces.
  • MD simulations were successfully applied to predict the molecular basis of the most common protein mutation N370S, causing Gaucher Disease.[30] In a follow-up publication it was shown that these blind predictions show a surprisingly high correlation with experimental work on the same mutant, published independently at a later point.

Computation Design

Design of a molecular dynamics simulation should account for the available computational power. Simulation size (n=number of particles), timestep and total time duration must be selected so that the calculation can finish within a reasonable time period.

However, the simulations should be long enough to be relevant to the time scales of the natural processes being studied. To make statistically valid conclusions from the simulations, the time span simulated should match the kinetics of the natural process. Otherwise, it is analogous to making conclusions about how a human walks from less than one footstep. Most scientific publications about the dynamics of proteins and DNA use data from simulations spanning nanoseconds (10−9 s) to microseconds (10−6 s).

To obtain these simulations, several CPU-days to CPU-years are needed. Parallel algorithms allow the load to be distributed among CPUs; an example is the spatial or force decomposition algorithm.
  • During a classical MD simulation, the most CPU intensive task is the evaluation of the potential (force field) as a function of the particles' internal coordinates. Within that energy evaluation, the most expensive one is the non-bonded or non-covalent part.
  • Another factor that impacts total CPU time required by a simulation is the size of the integration timestep. This is the time length between evaluations of the potential. The timestep must be chosen small enough to avoid discretization errors (i.e. smaller than the fastest vibrational frequency in the system). Typical timesteps for classical MD are in the order of 1 femtosecond (10−15 s).
  • For simulating molecules in a solvent, a choice should be made between explicit solvent and implicit solvent. Explicit solvent particles must be calculated expensively by the force field, while implicit solvents use a mean-field approach. Using an explicit solvent is computationally expensive, requiring inclusion of roughly ten times more particles in the simulation. But the granularity and viscosity of explicit solvent is essential to reproduce certain properties of the solute molecules. This is especially important to reproduce kinetics.
  • In all kinds of molecular dynamics simulations, the simulation box size must be large enough to avoid boundary condition artifacts. Boundary conditions are often treated by choosing fixed values at the edges (which may cause artifacts), or by employing periodic boundary conditions in which one side of the simulation loops back to the opposite side, mimicking a bulk phase. (Source: Wikipedia)

Exemplified Projects:

The following examples are not run-of-the-mill MD simulations. They illustrate notable efforts to produce simulations of a system of very large size (a complete virus) and very long simulation times (500 microseconds):
  • MD simulation of the complete satellite tobacco mosaic virus (STMV) (2006, Size: 1 million atoms, Simulation time: 50 ns, program: NAMD) This virus is a small, icosahedral plant virus which worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). Molecular dynamics simulations were used to probe the mechanisms of viral assembly. The entire STMV particle consists of 60 identical copies of a single protein that make up the viral capsid (coating), and a 1063 nucleotide single stranded RNA genome. One key finding is that the capsid is very unstable when there is no RNA inside. The simulation would take a single 2006 desktop computer around 35 years to complete. It was thus done in many processors in parallel with continuous communication between them.
  • Folding simulations of the Villin Headpiece in all-atom detail (2006, Size: 20,000 atoms; Simulation time: 500 µs = 500,000 ns, Program: Folding@home) This simulation was run in 200,000 CPU's of participating personal computers around the world. These computers had the Folding@home program installed, a large-scale distributed computing effort coordinated by Vijay Pande at Stanford University. The kinetic properties of the Villin Headpiece protein were probed by using many independent, short trajectories run by CPU's without continuous real-time communication. One technique employed was the Pfold value analysis, which measures the probability of folding before unfolding of a specific starting conformation. Pfold gives information about transition state structures and an ordering of conformations along the folding pathway. Each trajectory in a Pfold calculation can be relatively short, but many independent trajectories are needed.
  • Folding@home Initiative  Vijay Pande of Stanford University created the folding@home initiative based on molecular dynamics simulations. In a recent BioIT World Cloud summit, Vijay said that microsecond timescales are where the field is, but millisecond scales are “where we need to be, and seconds are where we’d love to be,” he said. Using a Markov State Model, Pande’s team is studying amyloid beta aggregation with the idea of helping identify new drugs to treat Alzheimer’s disease. Several candidates have already been identified that inhibit aggregation, he said. 

Links:

Update:
  • 2012.07.10 - original post
  • 2012.09.14 - added exemplified projects 
  • 2012.09.18 - added links, updated projects

No comments:

Post a Comment