Wednesday, October 10, 2012

IBM Launches New PureSystems For Big Data Analytics

IBM introduced its new PureData System, a new member of its family of expert integrated PureSystems that is powered with analytics and other technology for wrangling big data.


IBM is expanding its easy-to-implement high-performance PureSystems platform with the release today (2012.10.09) of platforms for transactions, big data analytics and operational analytics systems that can monitor more than 1,000 business operations for tasks such as fraud detection in credit card processing, support for CSRs in call centers and predicting demand for utilities.

PureSystems consist of high-density boxes configured by IBM with x86 or Power processors, storage, and network connections all wired and ready to install in a data center in hours rather than months. IBM says the PureSystems family is the result of $2 billion in R&D and acquisitions over four years. One key acquisition was the big data analytics company, Netezza.


IBM calls PureSystems expert integrated systems; other companies refer to similar systems as engineered. In any event, the concept seems relatively simple and overdue — the company manufactures the system and delivers specific types of computing power rather than leaving a client to buy the pieces for do-it-yourself assembly. It’s the difference between buying a car and having UPS deliver a series of crates with parts from a variety of suppliers. It’s an update of that adage about what clients want — they don’t want a server or a SAN, they want answers.

The three systems launched today are aimed at gaining value from big data. (I covered the PureSystems for application in a story a week or two ago.)

Phil Francisco, IBM vice president of big data product management, said the three systems are meant to help clients deal with data-center applications.

PureData System for Transactions is aimed a retail and credit card processing environments which depend on rapid handling of transactions and interactions. These transactions may be small, but the volume and frequency require fast and efficient environments. Francisco said the new system can provide 5X performance improvement, partly through work IBM has done in high performance storage.

Francisco said that a global bank which is deploying the transaction system has estimated it is saving ₤2 million in overall costs.

PureData System for Analytics builds on the Netezza acquisition to deliver results in seconds or minutes rather than hours. Francisco said it has the largest library of database analytical functions in the data warehouse market, said Francisco, and can scale across the terabyte or petabytes running on the system. It can support extremely high volume high speed analytics for clients like mobile phone carriers who want to identify potential churn and provide offers to retain customers.

“It allows very targeted, almost one-on-one marketing with predictive analytics and spatial analytics.”

The Pure brand is also proving successful in developing markets, said Pete McCaffrey director of marketing for Pure Systems.

“We are getting a lot of pickup and exploitation in growth markets; we have clients in Malaysia, India, Brazil and Hong Kong. In many cases they are using the technology to accelerate their ability to get to cloud-based infrastructure.” The company sees software vendors using Pure to provide their applications through a Software as a Service (SaaS) model and service providers setting up cloud models to attract customers who don’t want the capital costs and operational responsibilities of running their own data centers.

“In hyper growth markets they are quickly building out their infrastructures; this is a way for them to get new service running quickly.”


Delivering an assembled chassis designed for rapid deployment of applications or analytics also appeals to markets where technological skills are in short supply.

“We have bundled and integrated the systems and built in a lot of expertise,” added McCaffrey. “When you think about the skills in various parts of the world, they don’t necessarily have the deep technical skills around technology domains, so they like the delivery of something completely integrated.”

In Malaysia, DynaFront Systems has used PureSystems to support insurance carriers who want to reach clients on-line. The company expects that more than four million insurance users will be on the system within the first year. It estimates that Pure reduced integration complexity by 50 percent and will allow it to launch  new products in days rather than months.

Francisco said one client reported it was able to build and deliver more applications in the last 90 days than in the previous three years.

“The operational personnel got out of day-to-day tuning and moved into building value-add applications for clients,” he said. New applications on a platform designed for big data mean that firms can achieve results without employing armies of technologists.”

This contrasts with taking a general purpose solution and trying to to it to a very high performance application. That can take extensive tuning.

“What we have done in the Pure data platform is build in the expertise for those three type of workloads.”



Extra-Link:

Thursday, September 27, 2012

jSeries on BigCompute #5: Cosmology

This is #5 article in the BigCompute jSeries  in which we explore the application of high-performance computing in various fields of science and business.

Cosmology is the most ambitious discipline of sciences. Its goal, plainly stated, is to seek the understanding of the origin, evolution, structure and ultimate fate of the entire universe, a universe that is as enormous as it is ancient. Modern cosmology is dominated by the Big Bang theory, which attempts to bring together observational astronomy and particle physics.
As the science of the origin and development of the Universe, cosmology is entering one of its most scientifically exciting phases. Two decades of surveying the sky have culminated in the celebrated Cosmological Standard Model. While the model describes current observations to accuracies of several percent, two of its key pillars, dark matter and dark energy—together accounting for 95% of the mass energy of the Universe—remain mysterious.

Observing and Mapping the Universe

Surprisingly, figuring out what the universe used to look like is the easy part of cosmology. If you point a sensitive telescope at a dark corner of the sky, and run a long exposure, you can catch photons from the young universe, photons that first sprang out into intergalactic space more than ten billion years ago. Collect enough of these ancient glimmers and you get a snapshot of the primordial cosmos, a rough picture of the first galaxies that formed after the Big Bang. Thanks to sky-mapping projects like the Sloan Digital Sky Survey, we also know quite a bit about the structure of the current universe. We know that it has expanded into a vast web of galaxies, strung together in clumps and filaments, with gigantic voids in between.

Building a New Universe

The real challenge for cosmology is figuring out exactly what happened to those first nascent galaxies. Our telescopes don't let us watch them in time-lapse; we can't fast forward our images of the young universe. Instead, cosmologists must craft mathematical narratives that explain why some of those galaxies flew apart from one another, while others merged and fell into the enormous clusters and filaments that we see around us today. Even when cosmologists manage to cobble together a plausible such story, they find it difficult to check their work. If you can't see a galaxy at every stage of its evolution, how do you make sure your story about it matches up with reality? How do you follow a galaxy through nearly all of time? Thanks to the astonishing computational power of supercomputers, a solution to this problem is beginning to emerge: You build a new universe.

Projects & Systems

In October, the world's third fastest supercomputer, Mira, is scheduled to run the largest, most complex universe simulation ever attempted. The simulation will cram more than 12 billion years worth of cosmic evolution into just two weeks, tracking trillions of particles as they slowly coalesce into the web-like structure that defines our universe on a large scale. Cosmic simulations have been around for decades, but the technology needed to run a trillion-particle simulation only recently became available. Thanks to Moore's Law, that technology is getting better every year. If Moore's Law holds, the supercomputers of the late 2010s will be a thousand times more powerful than Mira and her peers. That means computational cosmologists will be able to run more simulations at faster speeds and higher resolutions. The virtual universes they create will become the testing ground for our most sophisticated ideas about the cosmos.


Extra Link:

Cross-Link:

Update:
  • 2012.09.27 - original post 

Knome Launches Genome Supercomputer

Knome, the informatics company co-founded by George Church that bills itself as the “human genome interpretation” company, is launching a “genome supercomputer” to enhance the interpretation of genome sequences.  The first Knome units will process one genome/day, but with headroom for much higher throughput later on. 

Designed chiefly to run Knome’s kGAP genome interpretation software, the compute system is designed – metaphorically perhaps -- to sit next to a sequencing instrument, and has been soundproofed for that purpose. The unit weighs in at two pounds shy of 600 pounds, and comes with a starting price tag of $125,000, according to the BioIT World report.

Notable Quotes:
 
“The advent of fast and affordable whole genome interpretation will fundamentally change the genetic testing landscape,” commented Church, Harvard Medical School professor of Genetics. “The genetic testing lab of the future is a software platform where gene tests are apps.”  

The launch of the so-called genome supercomputer represents “an evolution of our thinking,” says Knome president and CEO Martin Tolar. While the larger genomics research organizations have dedicated teams and datacenters to handle genome data, for the majority of Knome’s clients, Tolar says, “you really want to have integrated hardware and software systems.”  

With some 2,000 next-generation sequencing (NGS) instruments on the market, each close to sequencing a genome a day, Tolar asks: “Why not have a box sitting next to it to do the interpretation?” Ideally, he says, every NGS instrument should have a companion knoSYS 100 nearby. The system is a localized version of Knome’s existing genome analysis and interpretation software. “We’ve localized it, shrunk it, and modified it to work on a local system that sits behind the client’s firewall,” says Tolar. 


Cross-Link:

Wednesday, September 26, 2012

Big Data & Analytics Gone Bad

Nice job from folks at Radio Free for a series on HPC. This episode focused on Big Data & Analytics. In particular, how Big Data & Analytics could go bad.

In the video, the guys will talk about a couple of examples of when over-reliance on analytics leads to bad outcomes. The first deals with some high school kids who were allegedly found to be cheating by a plagiarism software program and the second looks at how a major bank may have lost up to $9 billion due to lack of proper controls and blind faith in existing analytic systems.




Update:
  • 2012.09.26 - original post

Cross-Link:

jPage: San Diego Supercomputing Center (SDSC)

Located at University of California San Diego, the San Diego Supercomputer Center (SDSC) enables international science and engineering discoveries through advances in computational science and data-intensive, high-performance computing.

Continuing this legacy into the era of cyberinfrastructure, SDSC is considered a leader in data-intensive computing, providing resources, services and expertise to the national research community including industry and academia The mission of SDSC is to extend the reach of scientific accomplishments by providing tools such as high-performance hardware technologies, integrative software technologies, and deep interdisciplinary expertise to these communities.

SDSC was founded in 1985with a $170 million grant from the National Science Foundation's (NSF) Supercomputer Centers program. From 1997 to 2004, SDSC extended its leadership in computational science and engineering to form the National Partnership for Advanced Computational Infrastructure (NPACI), teaming with approximately 40 university partners around the country. Today, SDSC is an Organized Research Unit of the University of California, San Diego with a staff of talented scientists, software developers, and support personnel.

SDSC is led by Dr. Michael Norman, who was named SDSC interim director in June 2009 and appointed to the position of director in September 2010. Norman is a distinguished professor of physics at UC San Diego and a globally recognized astrophysicist. As a leader in using advanced computational methods to explore the universe and its beginnings, Norman directed the Laboratory for Computational Astrophysics, a collaborative effort between UC San Diego and SDSC.

A broad community of scientists, engineers, students, commercial partners, museums, and other facilities work with SDSC to develop cyberinfrastructure-enabled applications to help manage their extreme data needs. Projects run the gamut from creating astrophysics visualization for the American Museum of Natural History, to supporting more than 20,000 users per day to the Protein Data Bank, to performing large-scale, award-winning simulations of the origin of the universe or how a major earthquake would affect densely populated areas such as southern California. Along with these data cyberinfrastructure tools, SDSC also offers users full-time support including code optimization, training, 24-hour help desk services, portal development and a variety of other services.

As one of the NSF's first national supercomputer centers, SDSC served as the data-intensive site lead in the agency's TeraGrid program, a multiyear effort to build and deploy the world's first large-scale infrastructure for open scientific research. SDSC currently provides advanced user support and expertise for XSEDE (Extreme Science and Engineering Discovery Environment) the five-year NSF-funded program that succeeded TeraGrid in mid-2011.

Within just the last two years, SDSC has launched several new supercomputer systems. The Triton Resource, an integrated, data-intensive compute system primarily designed to support UC San Diego and UC researchers was launched in 2009, along with Dash, the first high-performance compute system to leverage super-sized "flash memory" to accelerate investigation of a wide range of data-intensive science problems. Trestles, a 100-teraflops cluster launched in early 2010 as one of the newest XSEDE resources, is already delivering increased productivity and fast turn-around times to a diverse range of researchers.

In early 2012, SDSC will deploy Gordon, a much larger version of the Dash prototype. With 250 trillion bytes of flash memory and 64 I/O nodes, Gordon will be capable of handling massive databases while providing up to 100 times faster speeds when compared to hard drive disk systems for some queries.

External Link:

Cross-Link:


Supercomputer Gordon for genomics research

I visited San Diego Supercomputing Center (SDSC) at the University of California, San Diego on Sept 13 (see photo taken at the entrance to the center). It was almost impossible to miss the brochure and poster of a new system that just came online at SDSC. Its was named Gordon.

Gordon is the latest Track 2 system awarded by the National Science Foundation and was built by Appro based on its name of the Xtreme-X architecture.

Gordon entered production in the first quarter of 2012, deploying a vast amount of flash storage to help speed solutions now hamstrung by the slower bandwidth and higher latencies of traditional hard disks. Gordon's "supernodes" exploit virtual shared-memory software to create large shared-memory systems that reduce solution times and yield results for applications that tax even the most advanced supercomputers.

The machine is configured as a 1,024-computer-node (16,384-core) supercomputer cluster architecture, which enables it to perform complex functions for data-intensive applications, including the study of genomics.

The Protein Data Bank (PDB) was among the early entities to apply Gordon's capabilities. The PDB is a worldwide repository of information about the 3D structures of large biological molecules. The PDB group performs predictive science with queries on pair-wise correlations and alignments of protein structures that predict fold space and other properties.

According to Appro article, in a test configuration of spinning disks and solid-state drives, Gordon was able to help the PDB determine that SSD improves query performance by a factor of three compared to spinning disks. Such insight should aid the PDB in its future research of proteins, nucleic acids and other biological molecules.


External Link:

Cross-link:

Monday, September 24, 2012

IBM’s Power 775 wins recent HPC Challenge

Starting out as a government project 10 years ago, IBM Research’s high performance computing project, PERCS (pronounced “perks”), led to one of the world’s most powerful supercomputers, the Power 775. This July, the Power 775 continued to prove its power by earning the top spot in a series of benchmark components of the HPC Challenge suite. 
 
IBM Research scientist Ram Rajamony, who was the chief performance architect for the Power 775, talks about how the system beat this HPC Challenge, as reported in this blog

"Our response was named PERCS – Productive Easy-to-use Reliable Computing System. From the start, our goal was to combine ease-of-use and significantly higher efficiencies, compared to the state-of-the-art at the time. After four years of research, the third phase of the DARPA project – that started in 2006 – resulted in today’s IBM’s Power 775."


Update:
  • 2012.09.24 - original post .

Watson Races to the Cloud

Watson, the Jeopardy-winning supercomputer developed by IBM, could become a cloud-based service that people can consult on a wide range of issues, IBM announced yesterday.

In addition to improving Watson's machine-learning capabilities to increase the range of options the system gives clinicians - including nuancing these to cater for patient preferences, such as choosing chemotherapy that does not cause hair loss, for instance - the race is now on at IBM to make the system far more widely available.

"Watson is going to be an advisor and an assistant to all kinds of professional decision-makers, starting in healthcare and then moving beyond. We're already looking at a role for Watson in financial services and in other applications," says John Gordon, Watson Solutions Marketing Manager at IBM in New York.

You can find more about IBM Watson and related topic from the following links:

Links:

Sunday, September 23, 2012

A "Big Data HPC" Initiative: AMPLab

One of the exiting forays into combining big data and HPC is AMPLab, whose vision is to integrate algorithms, machines and people to make sense of big data.

 AMP stands for “Algorithms, Machines, and People” and the AMPLab is a five-year collaborative effort at UC Berkeley, involving students, researchers and faculty from a wide swath of computer science and data-intensive application domains to address the Big Data analytics problem.

AMPLab envisions a world where massive data, cloud computing, communication and people resources can be continually, flexibly and dynamically be brought to bear on a range of hard problems by people connected to the cloud via devices of increasing power and sophistication.

Founded by Amazon, Google and SAP, and powered by Berkeley, the group has already established a big data architecture framework and several applications that they have released to open source.

AMP Lab benefits from some real life data center workloads by analyzing the activity logs of real life, front line systems of up to 1000s of nodes servicing 100s of PB of data.


This post is part of jBook - Big Data HPC

Update:
 - 2012.09.23: original post

Wednesday, September 19, 2012

A Linkedin Group for Genomic Medicine

I started a Linkedin group for Genomic Medicine today to start building an online community for professional interested in the topic and subject.

Sequencing studies dug deep into lung cancer

Lung cancer is world's most deadly cancer and causes more deaths than any other form of cancer. About 1.6 million people worldwide are diagnosed with the disease each year, with fewer than 20% still alive five years later. Now a trio of genome-sequencing studies published this week in the journal "Nature" is laying the groundwork for more effective personalized treatment of lung cancers, in which patients are matched with therapies that best suit the particular genetic characteristics of their tumors.

Two of the latest studies profiled the genomes of tissue samples from 178 patients with lung squamous cell carcinomas1 and 183 with lung adenocarcinomas, the largest genomic studies so far performed for these diseases. A third study carried out more in-depth analyses of 17 lung tumours to compare the genomes of smokers and patients who had never smoked.

“For the first time, instead of looking through a keyhole we are getting a penthouse panoramic view,” says Ramaswamy Govindan, an oncologist at Washington University School of Medicine in St Louis, and an author of two of the studies. In the past, he says, researchers studying personalized therapies for lung cancer have mainly focused on a handful of genes, but this week’s studies reveal complex changes across the whole genome.

Govindan says that this first wave of what he calls “cataloguing studies” will help to transform how clinical trials in cancer are performed, with focus shifting to smaller trials in which a greater percentage of patients are expected to benefit from the therapy. Rather than lumping together many patients with diverse mutations, cancer patients will be segregated according to their mutations and treated accordingly. “When you look for more-effective therapies, you don't need larger trials,” he says.

The potential pay-off is clear: targeted therapies designed to address specific mutations can have fewer side effects and be more effective than conventional treatments that simply kill rapidly dividing cells. Several targeted drugs have already been approved for treating adenocarcinoma, which makes up more than 40% of all lung cancers, but none has so far been approved for lung squamous cell carcinoma, another common type, which is currently treated with non-targeted therapies. Among the wide array of mutations that emerged from the study on squamous cell carcinoma are many that could be targeted with drugs that are already on the market or in development for other diseases, says Matthew Meyerson, a genomics researcher at the Dana-Farber Cancer Institute in Boston, Massachusetts, and the Broad Institute in Cambridge, who worked on two of the studies.

The studies reveal new categories of mutations and also show a striking difference between lung cancer in smokers and non-smokers, with smokers’ tumours exhibiting several times the number of mutations as well as different kinds of mutations.

Non-smokers were likely to have mutations in genes such as EGFR and ALK, which can already be specifically targeted with existing drugs. Smokers were particularly likely to have damage in genes involved in DNA repair as well as other characteristic mutations. “These genomes are battle-scarred by carcinogen exposure,” says Govindan.

In addition, the patterns of mutations found in lung squamous cell carcinoma more closely resemble those seen in squamous cell carcinomas of the head and neck than those in other lung cancers. That finding adds further weight to the idea that classifying tumours by their molecular profiles, rather than their sites of origin, will be more effective in picking the right drugs to treat them. Perhaps, for instance, a drug approved for treating breast cancer could be tried in a lung cancer if both carry similar mutations.

And mutations implicated in other cancers did show up in the lung cancers. Overall, these studies reveal lung cancer as an extremely varied disease, says Roy Herbst, chief of medical oncology at the Yale Cancer Center in New Haven, Connecticut. “What amazes me is the heterogeneity,” he says. He foresees the rise of an era of “focused sequencing” over the next year or so, in which clinicians could profile 400 or 500 genes to help guide the course of therapy. Profiling all the genes or all of a patient's genome would provide more data than oncologists could use. But to do this well, he says, mutations need to be linked with more information, such as when and where metastases occurred and how effective the drugs were. Meyerson agrees. “The data that are really going to be informative is when you combine genomic data with outcomes of targeted therapies,” he says.

But lung cancer will still be tough to beat, he warns. For example, tumours usually become resistant to targeted therapies, and picking the best drug to try next would probably require a second genomic analysis.


Source: Nature

Friday, September 14, 2012

Science Pushes the Limit of HPC & Cloud

Cutting-edge scientific research from high-energy physics to genomic medicines continued to push the frontier of high-performance computing and increasingly Big Data and cloud computing. This week, a collection of cloud computing and HPC resources and scientific applications were in the spot light at the Bio-IT World Cloud Summit in San Francisco.  

The event was covered in this article from BioIT World.

Research Using Supercomputing
  • Miron Livny - discussed OSG and its application in the search for the Higgs boson.Future challenges, Livny said, included what he called the “portability challenge” and the “provisioning challenge.” The former was how to make sure a job running on a desktop can also run on as many “foreign” resources as possible. The latter was being addressed by using targeted spot instances in the Amazon cloud, with prices dropping below 2 cents/hour. “Use it when the price is right, get out as fast as possible when the price is wrong,” Livny advised.   
  • Jason Stowe (Cycle Computing) reviewed Cycle’s successes in spinning up high-performance computers with 50,000 cores on Amazon, such as a project with Schrodinger and Nimbus Discovery to screen a cancer drug target.  
  • Victor Ruotti (Morgridge Institute) is about halfway through his ambitious experiment using the cloud to conduct an extensive pairwise comparison of RNAseq signatures from 124 embryonic stem cell samples. By performing a total of some 15,000 alignments, Ruotti intends to create a sequence-based index to facilitate the precise identification of unknown ES cell samples. 

HPC Cloud for Research
  • Mirko Buholzer (Complete Genomics) presented a new “centralized cloud solution” that Complete Genomics is developing to expedite the digital delivery of genome sequence data to customers, rather than the current system of shipping hard drives, fulfilled by Amazon via FedEx or UPS. 100 genomes sequenced to 40x coverage consumers about 35 TB data, or a minimum of 12 hard drives, said Buholzer.  The ability to download those data was appealing in principle, but to where exactly? Who would have access? Complete plans to give customers direct access to their data in the cloud, providing information such as sample ID, quality control metrics, and a timeline or activity log. For a typical genome, the reads and mappings make up about 90% of the total data, or 315 GB. (Evidence and variants make up 31.5 GB and 3.5 GB, respectively.)   Customers will be able to download the data or push it to an Amazon S3 bucket. The system is currently undergoing select testing, but Buholzer could not say whether anyone had agreed to forego their hard drives just yet.  

Dealing with Data Challenge
  • Gary Stiehr (The Genome Institute at Washington University) described the construction of The Genome Institute’s new data center, required because of the unrelenting growth of next-generation sequencing data. “The scale of HPC wasn’t the challenge—but the time scale was caused by rapid, unrelenting growth,” said Stiehr.  The new data center required more power and cooling capacity, and data transfers reaching 1 PB/week. The issue, said Stiehr, was whether to move the data to the compute nodes, or analyze the data already on the nodes by using internal data storage and processing the data stored there. 
State-of-art for Supercomputing
  • Robert Sinkovits (San Diego Supercomputer Center) described Gordon, the supercomputer that makes extensive use of flash memory that is available to all academic users on a competitive basis. “It’s very good for lots of I/O,” said Sinkovits.  A great Gordon application, said Sinkovits, will among other things, make good use of the flash storage for scratch/staging; require the large, logical shared memory (approximately 1 TB DRAM); should be a threaded app that scales to a large number of cores; and need a high-bandwidth, low latency inter-processor network. The Gordon team will turn away applications that don’t fully meet these requirements, he said, but singled out computational chemistry as one particularly good match.  Gordon features 64 dual-socket I/O nodes (using Intel Westmere processors) and a total of 300 TB flash memory. Other features include a dual-rail 3D Torus InfiniBand (40Gbit/s) network and a 4-PB Lustre-based parallel file system, capable of delivering up to 100 GB/s into the computer.  
  • Weijia Xu (Texas Advanced Computer Center/TACC) introduced the Stampede supercomputer which should be online early next year. It features100,000 conventional Intel processor cores and a total of 500,000 cores, along with 14 Petabytes disk, 272 TB+ of RAM, and a 56-Gbyte FDR InfiniBand Interconnect. 
  • Nan Li (National Center for Supercomputing, Tianjin) described Tianhe-1A (TH-1A), the top-ranked supercomputer in China, with a peak performance of 4.7 PFlops, which is housed at the National Supercomputer Center in TianJin. (The computer was ranked the fastest in the world two years ago.) Applications range from geology, video rendering, and engineering, but include a number of biomedical research functions. Among users are BGI and a major medical institute in Shanghai. Li indicated this resource could also be made available for the pharmaceutical industry. 
  • Makoto Taiji (Riken) highlighted Japan’s K Computer. The computer, which is located in Kobe, Japan, began in 2006. The cost has been estimated at $1.25 billion. For that, one gets 80,000 nodes (640,000 cores), memory capacity exceeding 1 PB (16 GB/node) and 10.51 PetaFlops (3.8 PFlops sustained performance). Using a 3D-Torus Network, bandwidth is 6 GB/s, bidirectional for each of six directions.  Power efficiency is ranked at 20 MW, or about half of Blue Gene. Taiji said the special features of the K Computer include high bandwidth and low latency. Anyone can use the K computer—academics and industry—for free if results are published. Life sciences applications make up about 25% of K computer usage, with applications including protein dynamics in cellular environments, drug design, large-scale bioinformatics analysis, and integrated simulations for predictive medicine.  

Update:
  • 2012.09.14 - original post 

Thursday, September 13, 2012

jSeries on BigCompute: Molecular Dynamics

With all the interest and focus on Big Data these days, it's easy to miss the the other side of the coin for computational challenge - compute-intensive applications. These applications are widely used across scientific domain for simulation and modeling of the physical world. They can also draw upon a lot of data so there is overlap with Big Data problem (example, simulation of wind turbine using historical and predictive algorithms for weather data).

So today I will start a series of blog called jSeries on BigCompute to examine application of HPC in science, engineering and commercial fields.

Molecular dynamics (MD) is a computer simulation of physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a period of time, giving a view of the motion of the atoms. In the most common version, the trajectories of molecules and atoms are determined by numerically solving the Newton's equations of motion for a system of interacting particles, where forces between the particles and potential energy are defined by molecular mechanics force fields. The method was originally conceived within theoretical physics in the late 1950s and early 1960s , but is applied today mostly in materials science and the modeling of biomolecules.

Application in Sciences

Molecular dynamics is used in many fields of science.
  • First macromolecular MD simulation published (1975) was a folding study of Bovine Pancreatic Trypsine Inhibitor. This is one of the best studied proteins in terms of folding and kinetics. Its simulation published in Nature paved the way for understanding protein motion as essential in function and not just accessory.
  • MD is the standard method to treat collision cascades in the heat spike regime, i.e. the effects that energetic neutron and ion irradiation have on solids an solid surfaces.
  • MD simulations were successfully applied to predict the molecular basis of the most common protein mutation N370S, causing Gaucher Disease.[30] In a follow-up publication it was shown that these blind predictions show a surprisingly high correlation with experimental work on the same mutant, published independently at a later point.

Computation Design

Design of a molecular dynamics simulation should account for the available computational power. Simulation size (n=number of particles), timestep and total time duration must be selected so that the calculation can finish within a reasonable time period.

However, the simulations should be long enough to be relevant to the time scales of the natural processes being studied. To make statistically valid conclusions from the simulations, the time span simulated should match the kinetics of the natural process. Otherwise, it is analogous to making conclusions about how a human walks from less than one footstep. Most scientific publications about the dynamics of proteins and DNA use data from simulations spanning nanoseconds (10−9 s) to microseconds (10−6 s).

To obtain these simulations, several CPU-days to CPU-years are needed. Parallel algorithms allow the load to be distributed among CPUs; an example is the spatial or force decomposition algorithm.
  • During a classical MD simulation, the most CPU intensive task is the evaluation of the potential (force field) as a function of the particles' internal coordinates. Within that energy evaluation, the most expensive one is the non-bonded or non-covalent part.
  • Another factor that impacts total CPU time required by a simulation is the size of the integration timestep. This is the time length between evaluations of the potential. The timestep must be chosen small enough to avoid discretization errors (i.e. smaller than the fastest vibrational frequency in the system). Typical timesteps for classical MD are in the order of 1 femtosecond (10−15 s).
  • For simulating molecules in a solvent, a choice should be made between explicit solvent and implicit solvent. Explicit solvent particles must be calculated expensively by the force field, while implicit solvents use a mean-field approach. Using an explicit solvent is computationally expensive, requiring inclusion of roughly ten times more particles in the simulation. But the granularity and viscosity of explicit solvent is essential to reproduce certain properties of the solute molecules. This is especially important to reproduce kinetics.
  • In all kinds of molecular dynamics simulations, the simulation box size must be large enough to avoid boundary condition artifacts. Boundary conditions are often treated by choosing fixed values at the edges (which may cause artifacts), or by employing periodic boundary conditions in which one side of the simulation loops back to the opposite side, mimicking a bulk phase. (Source: Wikipedia)

Exemplified Projects:

The following examples are not run-of-the-mill MD simulations. They illustrate notable efforts to produce simulations of a system of very large size (a complete virus) and very long simulation times (500 microseconds):
  • MD simulation of the complete satellite tobacco mosaic virus (STMV) (2006, Size: 1 million atoms, Simulation time: 50 ns, program: NAMD) This virus is a small, icosahedral plant virus which worsens the symptoms of infection by Tobacco Mosaic Virus (TMV). Molecular dynamics simulations were used to probe the mechanisms of viral assembly. The entire STMV particle consists of 60 identical copies of a single protein that make up the viral capsid (coating), and a 1063 nucleotide single stranded RNA genome. One key finding is that the capsid is very unstable when there is no RNA inside. The simulation would take a single 2006 desktop computer around 35 years to complete. It was thus done in many processors in parallel with continuous communication between them.
  • Folding simulations of the Villin Headpiece in all-atom detail (2006, Size: 20,000 atoms; Simulation time: 500 µs = 500,000 ns, Program: Folding@home) This simulation was run in 200,000 CPU's of participating personal computers around the world. These computers had the Folding@home program installed, a large-scale distributed computing effort coordinated by Vijay Pande at Stanford University. The kinetic properties of the Villin Headpiece protein were probed by using many independent, short trajectories run by CPU's without continuous real-time communication. One technique employed was the Pfold value analysis, which measures the probability of folding before unfolding of a specific starting conformation. Pfold gives information about transition state structures and an ordering of conformations along the folding pathway. Each trajectory in a Pfold calculation can be relatively short, but many independent trajectories are needed.
  • Folding@home Initiative  Vijay Pande of Stanford University created the folding@home initiative based on molecular dynamics simulations. In a recent BioIT World Cloud summit, Vijay said that microsecond timescales are where the field is, but millisecond scales are “where we need to be, and seconds are where we’d love to be,” he said. Using a Markov State Model, Pande’s team is studying amyloid beta aggregation with the idea of helping identify new drugs to treat Alzheimer’s disease. Several candidates have already been identified that inhibit aggregation, he said. 

Links:

Update:
  • 2012.07.10 - original post
  • 2012.09.14 - added exemplified projects 
  • 2012.09.18 - added links, updated projects

Wednesday, September 12, 2012

jSeries on BigCompute: Weather Forecast

The blog post is part of the jSeries on BigCompute on the use cases for high-performance computing.

Supercomputers are essential assets in the meteorological community. They are a tool used by meteorologists and climatologists to uncover patterns in weather. As part of a suite of weather instruments, supercomputers can save lives by providing advance warnings of storms such as cyclone and hurricane. They can also provide researchers with inside knowledge of the way a storm works.

Cyclone Nargis

NASA recently uncovered information, using a supercomputer, about the ways storms work by reproducing the birth of Cyclone Nargis. Bo-wen Shen, a researcher at the University of Maryland-College Park, used NASA's Pleiades supercomputer to produce the first 5-day advance model of the birth of a tropical cyclone. The result is a leap forward in tropical cyclone research.

The real accomplishment is not about the Myanmar cyclone so much as it is the advanced warnings that may results. The researchers inserted known facts about the storms wind speeds, atmospheric pressure, and ocean temperatures to produce the model. The results are then compared to the known storm path taken by the cyclone. If the model matches, it is a winner. This means other storms that have yet to make landfall can be predicted in advance with more accuracy. Unfortunately, more work needs to be done. While the supercomputer worked for this storm, it may not work for others. You can see the full video simulation at NASA.

From Katrina to Issac

Thanks to advances in computing power and storm surge modeling systems, Louisiana officials bracing for Hurricane Isaac's arrival last month had more detailed data about the storm's potential impact than they had seven years earlier when they were preparing for Hurricane Katrina.

Researchers at university supercomputing centers in Texas and Louisiana used real-time data to inform emergency workers about what would happen once the hurricane sent water into canals, levies and neighborhoods.

When Katrina hit in 2005, tools for modeling storm surges, while good, were rudimentary compared with what's available today. Back then, Louisiana used computer models with up to 300,000 "nodes," and it took six hours to run a simulation.

For each node, which represents a particular location on a map, algorithms run computations to determine what will happen during a hurricane. The number of nodes represented is roughly analogous to the number of dots per square inch in a photograph: The higher the number, the more detail that's available.

Today, simulations with some 1.5 million nodes can be completed in an hour and a half, said Robert Twilley, an oceanographer and executive director of the Louisiana Sea Grant Program.

Louisiana is using an unstructured grid. To provide neighborhood-level details about potential flooding, nodes can be concentrated in areas that are most vulnerable. The system also helped identify the best staging areas for recovery efforts.


Forecasting Weather Hyper-locally

I have written about how supercomputer can make possible the hyper-local and near real-time weather forecast. In this blog series, I described how a 2011 New Year Eve tornado hit our subdivision in Sunset Hills, Missouri and what the impact of a hyper-local weather forecast system (eg IBM Deep Thunder) can have our people's lives in face of natural disaster.

Source:

jSeries on BigCompute: Financial Risk Analysis

The worldwide financial sector is highly regulated, with risk analysis management solutions being an essential requirement of any trading organisation. Financial risk analysis uses Monte-Carlo simulation, a complex process of data sampling and stochastic analysis, to determine the uncertainty associated with a given set of financial assets.  Risk analysis is both computationally and data intensive, and is one of the most rapidly growing application areas for high performance computing.

Stochastic analysis processes are random, where the input variables are determined at run time using a probability distribution and random number seed. They are typically repeated many thousands or even millions of times. Time series analysis and Monte Carlo simulations are good examples of this and used widely in predicting complex events whose interactions may not fully understood.

Example Projects/Use Cases
  • IBM Netezza - A financial institution must calculate value-at-risk for an equity options desk. The IBM Netezza platform was able to run a Monte Carlo simulation on 200,000 positions with 1,000 underlying stocks (2.5 billion simulations) in under three minutes. Leveraging an in-database analytics approach allowed the financial institution to analyze the data where it resides, rather than build a parallel data-processing platform to run the simulation. Faster query response time—and eliminating the time required to move data between two platforms—allowed the company to add variables to investment strategy simulations, and to run the risk analysis more frequently.
  • Murex MX3 - Headquartered in Paris France, and with offices throughout Europe, the USA and Asia-Pacific, Murex is one of Europe’s largest software developers, and one of the world’s leading providers of software and services to the financial sector. Its flagship product, Murex MX-3, is used for risk analysis in financial market trading, and has over 36,000 users at 200 institutions in 65 countries worldwide. Murex’s new HPC capability now allows the management of complex financial products in high precision, and in near real time, compared to a previous capability of only computing analytics once or twice a day.

Update:
  • 2012.09.24 - Updated into jSeries

A new entrant into HPC Cloud market - ProfitBricks

This week, a new entrant into the high-performance computing cloud market named ProfitBricks is coming forth with impressive network speed capabilities.

ProfitBricks uses InfiniBand, a wired fabric that allows more than triple the data transfer rate between servers as compared to two of the industry's biggest players, Amazon and Rackspace. According to its website, ProfitBricks uses two network interface cards per server device so the transmission speed can reach 80 Gbit/sec at present. InfiniBand network/communication technology offers high throughput, low latency, quality of service and failover.

ProfitBricks is a European company founded in 2010 by Achim Weiss and Andreas Gauger, whose former managed hosting company, 1&1, was sold to United Internet and now is a leading international Web hosting company.

ProfitBricks allows users to customize their public cloud to their heart's content, providing a range of computing options that it says is wider than most of the big players in the market.

Melanie Posey, an IDC researcher, says ProfitBricks is carving out a niche for itself around HPC.
"Since ProfitBricks is coming into the IaaS up against established giants like Amazon and Rackspace (and now Google and Microsoft), they need to establish some differentiation in the market -- the technology is one of those differentiators," she says. Given the increased role analyzing large amounts of data could play in the future, she expects HPC offerings may continue to be an area that service providers look to up their offerings in. IBM, for example, she says is making a big push to provide data analytics software for use either on-site or as a cloud-based service.

Links: 

Update:
  • original: 2012.09.12

Tuesday, September 11, 2012

My First Industry-standard Professional Certification

Through a joint program between IBM and Open Group, I received my first industry-standard professional certification today - Open Group Distinguished IT Specialist.

According to Open Group website on certifcation:

The Open Group provides Certification Programs for people, products and services that meet our standards. For enterprise architects and IT specialists, our certification programs provide a worldwide professional credential for knowledge, skills and experience. For IT products, Open Group Certification Programs offer a worldwide guarantee of conformance.

Intel Makes Move to Tackle Big Data & Cloud Challenge

Intel on Monday said it is developing high-performance server chips that in the future will serve up faster results from cloud services or data-intensive applications like analytics, all while cutting electricity bills in data centers. An integrated fabric controller—currently fabric controllers are found outside the processor—will result in fewer components in the server node itself, reduced power consumption by getting rid of the system I/O interface and greater efficiency and performance.

Intel has quietly been making a series of acquisitions to boost its interconnect and networking portfolio. Intel bought privately held networking company Fulcrum for an undisclosed price in July last year, acquired InfiniBand assets from Qlogic $125 million in January this year, and then purchased interconnect assets from Cray $140 million in April. 
The chip maker will integrate a converged fabric controller inside future server chips, according to Raj Hazra, vice president of the Intel Architecture Group,

Fabric virtualizes I/O and ties together storage and networking in data centers, and an integrated controller will provide a wider pipe to scale performance across distributed computing environments.

The integrated fabric controller will appear in the company's Xeon server chips in a few years, Hazra said. He declined to provide a specific date, but said the company has the manufacturing capability in place to bring the controller to the transistor layer.

The controller will offer bandwidth of more than 100 gigabytes per second, which will be significantly faster than the speed offered by today's networking and I/O interfaces. The chips have enough transistors to accommodate the controllers, which will only add a few watts of power draw, Hazra said.

Companies with huge Web-serving needs like Google, Facebook and Amazon buy servers in large volumes and are looking to lower energy costs while scaling performance. Fabrics connect and facilitate low-latency data movement between processors, memory, servers and endpoints like storage and appliances. Depending on the server implementation and system topology, fabrics are flexible and can organize traffic patterns in an energy-efficient way, Hazra said.

For example, analytics and databases demand in-memory processing, and cloud services rely on a congregation of low-power processors and shared components in dense servers. An integrated controller will help fabrics intelligently reroute or pre-fetch data and software packets so shared endpoints work together to serve up faster results. HPC or high-end server environments may use a fabric with a mix of InfiniBand, Ethernet networking and proprietary interconnect technologies, while a cloud implementation may have microservers with fabric based on standard Ethernet and PCI-Express technologies.

Fabric controllers currently sit outside the processor, but integration at the transistor level also reduces the amount of energy burned in fetching data from the processor and memory, Hazra said. The integrated controller in the CPU will be directly connected to the fabric, and will also make servers denser with fewer boards, cables and power supplies, which could help cut power bills, Hazra said.

Intel for decades has been integrating computing elements at the transistor level to eke out significant power savings and better performance from processors. Intel has integrated the memory controller and graphics processor, and the fabric controller is next, Hazra said.

"That's the path we're on with fabrics," Hazra said. "Integration is a must."

Intel's processor business is weakening partly due to a slowdown in PC sales, and the company's profits are now being driven by the higher-margin data center business. Intel's server processors already dominate data centers, and integration of the fabric controller is a key development in the company's attempts to bring networking and storage closer to servers.

Saturday, September 8, 2012

Journey Stories: How Science and Technology Beats Cancer

NY Times recently published a wonderful and amazing story on how doctors from Washington University in St. Louis (WashU) applied  next-gen sequencing technologies to identify genetic cause of leukemia and defeat it with modern medicine and passion.

Genetic Gamble: In Treatment for Leukemia, Glimpses of the Future

On a related story from WashU Genome Institute website, the first application of NGS towards discovery of cancer genes was detailed:

Cancer Genomics: Acute Myeloid Leukemia (AML)



Challenge for Cloud: NGS Big Data

Trius Brown wrote a thoughtful blog about the challenge of using Cloud computing to tackle the DNA sequencing analysis. The biggest obstacle remains with the large amount of data which would claim a lot of storage (and $) in the cloud.

"I think one likely answer to the Big Data conundrum in biology is that we'll come up with cleverer and cleverer approaches for quickly throwing away data that is unlikely to be of any use. Assuming these algorithms are linear in their application to data, but have smaller constants in front of their big-O, this will at least help stem the tide. (It will also, unfortunately, generate more and nastier biases in the results...) But I don't have any answers for what will happen in the medium term if sequencing continues to scale as it does."


Update:
  • 2012.9.08 - original post

Thursday, September 6, 2012

IBM on Top in High Performance Computing

IBM, HP and Dell lead the worldwide high performance computing (HPC) market, though sales were essentially flat in the second quarter.

While they battle it out for market share in the worldwide server space, technology giants IBM and Hewlett-Packard (HP) are also in close contention for worldwide market leadership in high performance computing (HPC), capturing 32.7 percent and 29.8 percent of overall revenue share, respectively, according to IT research firm IDC's Worldwide High-Performance Technical Server QView.

Overall, worldwide factory revenue for the HPC technical server market was essentially flat year over year in the second quarter of 2012 (2Q12). According to the report, revenue in the second quarter dipped slightly (-0.9 percent) to $2.4 billion, down from $2.5 billion in the same period of 2011. Despite the 2Q12 numbers, IDC said it still expects HPC technical server market revenues to expand by 7.1 percent year over year to reach $11 billion, exceeding 2011's record-breaking revenues of $10.3 billion.

Although the report noted average selling points continue to grow, thanks to an ongoing, multi-year shift to large system sales, 2Q12 unit sales declined by more than 21 percent to 22,998 compared to the second quarter of 2011. During the first half of 2012, the HPC technical server market declined by 1 percent, with a decline of 11 percent in unit shipments, compared to the same period in 2011, the report noted. Revenue in the high-end Supercomputers segment for HPC systems sold for $500,000 and up was the strongest performer in the market, jumping 21.8 percent over 1Q12 to reach $1.17 billion.

The high-end Supercomputers segment accounted for 48.6 percent of worldwide HPC technical server revenue in 2Q12, while the Divisional segment ($250,000 to $499,000 price band) captured 13.4 percent of overall revenue. At the other end of the price spectrum, revenue for Workgroup HPC systems sold for below $100,000 -- experienced a decline of 12.5 percent in the first half of 2012 when compared to the first half of 2011.

On the vendor side, behind IBM and HP came Dell, which maintained its strong third-place position with 14.2 percent of global revenue, while Cray (+43.7 percent), Fujitsu (+33.5 percent), and SGI (+10.3 percent) all made impressive year-over year revenue gains during the second quarter of 2012. IDC said it expects the HPC technical server market to grow at a healthy 7.3 percent compound annual growth rate (CAGR) over the five-year forecast to reach revenues of $14 billion by 2016.

"HPC technical servers, especially Supercomputers, have been closely linked not only to scientific advances but also to industrial innovation and economic competitiveness. For this reason, nations and regions across the world are increasing their investments in supercomputing even in today's challenging economic conditions," Earl Joseph, program vice president for technical computing at IDC, said in prepared remarks. "We expect the global race for HPC leadership in the petascale-exascale era to continue heating up during this decade."

Links:
Update:
  • 2012.09.05 - original post

Friday, August 31, 2012

Building Effective HPC System - Big Data Challenge for Big Compute

As HPC system and applications grow more complex and large, the monitoring and analysis of performance are rising as a Big Data challenge. Millions of data points need to be collected rapidly, analyzed constantly to enable optimal placement of workloads and problem determination. 

NSF just funded a pair of universities to evaluate effectiveness of research HPC systems  and below is the news release from one of them.

University at Buffalo, TACC Receive Funding to Evaluate XSEDE Clusters
 
AUSTIN, TX, Aug. 30 -- A National Science Foundation (NSF) grant is funding the University at Buffalo and the Texas Advanced Computing Center (TACC) at The University of Texas at Austin to evaluate the effectiveness of high-performance computing (HPC) systems in the NSF Extreme Science and Engineering Discovery Environment (XSEDE) program and HPC systems in general.

Today's high-performance computing systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. In this environment, service providers, users, system managers and funding agencies find it difficult to know if systems are realizing their optimal performance, or if all subcomponents are functioning properly.

Through the "Integrated HPC Systems Usage and Performance of Resources Monitoring and Modeling (SUPReMM)" grant, the University at Buffalo and TACC will develop new tools and a comprehensive knowledge base to improve the ability to monitor and understand performance for diverse applications on HPC systems.

The close to $1 million grant will build on and combine work that has been underway at the University at Buffalo under the Technology Audit Service (TAS) for XSEDE and at TACC as part of the Ranger Technology Insertion effort.

"Obtaining reliable data without efficient data management is impossible in today's complex HPC environment," said Barry Schneider, program director in the NSF's Office of Cyberinfrastructure. "This collaborative project will enable a much more complete understanding of the resources available through the XSEDE program and will increase the productivity of all of the stakeholders, service providers, users and sponsors in our computational ecosystem."

"Ultimately, it will advance our goals of providing open source tools for the entire science community to effectively utilize all HPC resources being deployed by the NSF for open science research in the academic community," Schneider said.

Working with the XSEDE TAS team at Buffalo, TACC staff members are running data gathering tools on the Ranger and Lonestar supercomputers to evaluate data that is relevant to application performance.
"We gather data on every system node at the beginning and end of every job, and every 10 minutes during the job—that's a billion records per system each month," said Bill Barth, director of high-performance computing at TACC. "It's going to end up being a Big Data problem in the end."

The tools will present various views on existing XSEDE usage data from the central database, according to Barth. This data will include how individual user jobs and codes are performing on a system at a detailed level. In the coming year, the research and development effort will gather data and evaluate performance on all XSEDE systems, including Stampede, which will launch in January 2013.

"HPC resources are always at a premium," said Abani Patra, principal investigator of the University at Buffalo project. "Even a 10 percent increase in operational efficiency will save millions of dollars. This is a logical extension of the larger XSEDE TAS effort."

TAS, through the XSEDE Metric on Demand (XDMoD) portal, provides quantitative and qualitative metrics of performance rapidly to all stakeholders, including NSF leadership, service providers, and the XSEDE user community.

Work on the grant began on July 1, 2012, and will continue for two years.
-----
Source: Texas Advanced Computing Center

Update
  •  2012.08.31 - original post

Thursday, August 30, 2012

HPC Rulebook (#1) - Scaled Speedup

People some time asks me what high-performance computing is and how it is different from other types of computing such as desktop, handhold or cloud. In this series called HPC Rulebook, I will attempt to summerize a few unique and possibly defining characteristics of high-performance computing.

The first one is called Scaled Speedup.

High performance computing is accomplished by a scalable architecture to speed up computation. The problem is broken down into numerous digestable chunks and dispatched to dozens to thousands of work-horses (compute nodes) to compute and then the outcome aggregated into result. The speedup is a measurement of scalability by comparing the run time on many versus single system. In a well-architected system (balanced CPU, network IO and memory), a well-developed parallel application can scale to large number of nodes.

In real-life scenario, the computing problem is often fixed so the speedup eventually will be limited as there are just so much data to be computed and the amount of time it will take to divvy up the task will overrun the extra machines being added to the system.

However, in the frontier of science, research problem shouldn't be fixed so a new way of thinking is necessary to fully explore and extend the power of HPC. Therefore there is the notion of scaled speedup in which the problem set can also scale along with the computing power so a much larger problem set can be completed in a fixed amount of time. This scaled speedup is also known as the "Gustafson's Law".



Tuesday, August 28, 2012

IBM Strengthens SmartCloud with PaaS Offering

IBM announced today the pre-release of IBM SmartCloud Application Services, coming in September.

IBM SmartCloud Application Services (SCAS) is the new Platform as Service (PaaS) offering that will become accessible to all existing commercial SmartCloud Enterprise (SCE) clients within the next few weeks and to new SCE clients as they sign up. Please note that clients who sign up for SCE via the 2012 Fall Promotion will receive different services; see Fall Promo materials for details.

The IBM SmartCloud Application Services pre-release services will be enabled in existing SmartCloud Enterprise accounts during the month of September. IBM will roll out the services in waves and we expect all accounts to be enabled by the end of the month.

The benefits of IBM SmartCloud Application Services:

  • Self-service, instant access to an application development suite of tools, middleware and databases (available via pattern-based technology), optimized to run on a virtual infrastructure. IBM builds expertise into these pre-integrated patterns, which accelerate the development and delivery of new applications, eliminate manual errors, and drive consistent results.
  • An application environment that simplifies deployment and management of applications and automatically scales by adding capacity based on load.
  • Accelerated time to value, leveraging these rapidly deployed, flexible and scalable resources to enable enterprise application development and deployment in the cloud.
  • On-demand, pay-as-you-go model -- an alternative to a fixed-cost model. 

Links:

Amazon Reshapes Computing by Cloud

NYT has an in-depth article today on Amazon Web Services and how this pioneer of Cloud computing is reshaping the industry.

Here is the link to the article.

Monday, August 27, 2012

IBM plugs another Cloud-based software provider

IBM said today it had agreed to buy Wayne, Pa., human resources software firm Kenexa Corp. for $1.3 billion or $46 per share.

The acquisition "brings a unique combination of Cloud-based technology and consulting services that integrate both people and processes," IBM said in a statement.

Kenexa is a social network that helps companies recruit workers. The acquisition "bolsters IBM's leadership in helping clients embrace social business capabilities, while gaining actionable insights form the enormous streams of information generated from social networks every day," the statement said.
"Every company, across every business operation, is looking to tap into the power of social networking to transform the way they work, collaborate and out innovate their competitors," said Alistair Rennie, IBM's general manager for social business.

"The customer is the big winner in all this because the combination of our two organizations will deliver more business outcomes than ever before," said Kenexa Chief Executive Officer Rudy Karsan.
IBM also gains access to data related to Kenexa's 8,900 customers, which include business giants in financial services, pharmaceuticals and retail, including half of the Fortune 500 firms.

Cloud & Big Data Can Float or Sink Many

I came across this insightful article from AP today on how Cloud and Big Data can be literally the last straw to make or break HP and Dell as they float or sink in the wave of tech innovation brought upon by Apple and Google.

SAN FRANCISCO (AP) — Hewlett-Packard Co. used to be known as a place where innovative thinkers flocked to work on great ideas that opened new frontiers in technology. These days, HP is looking behind the times.

Coming off a five-year stretch of miscalculations, HP is in such desperate need of a reboot that many investors have written off its chances of a comeback.

Consider this: Since Apple Inc. shifted the direction of computing with the release of the iPhone in June 2007, HP's market value has plunged by 60 percent to $35 billion. During that time, HP has spent more than $40 billion on dozens of acquisitions that have largely turned out to be duds so far.

"Just think of all the value that they have destroyed," ISI Group analyst Brian Marshall said. "It has been a case of just horrible management."

Marshall traces the bungling to the reign of Carly Fiorina, who pushed through an acquisition of Compaq Computer a decade ago despite staunch resistance from many shareholders, including the heirs of HP's co-founders. After HP ousted Fiorina in 2005, other questionable deals and investments were made by two subsequent CEOS, Mark Hurd and Leo Apotheker.

HP hired Meg Whitman 11 months ago in the latest effort to salvage what remains of one of the most hallowed names in Silicon Valley 73 years after its start in a Palo Alto, Calif., garage.

The latest reminder of HP's ineptitude came last week when the company reported an $8.9 billion quarterly loss, the largest in the company's history. Most of the loss stemmed from an accounting charge taken to acknowledge that HP paid far too much when it bought technology consultant Electronic Data Systems for $13 billion in 2008.

HP might have been unchallenged for the ignominious title as technology's most troubled company if not for one its biggest rivals, Dell Inc.

Like HP, Dell missed the trends that have turned selling PCs into one of technology's least profitable and slowest growing niches. As a result, Dell's market value has also plummeted by 60 percent, to about $20 billion, since the iPhone's release.

That means the combined market value of HP and Dell – the two largest PC makers in the U.S. – is less than the $63 billion in revenue Apple got from iPhones and various accessories during just the past nine months.
The hand-held, touch-based computing revolution unleashed by the iPhone and Apple's 2010 introduction of the iPad isn't the only challenge facing HP and Dell.

They are also scrambling to catch up in two other rapidly growing fields – "cloud computing" and "Big Data."
Cloud computing refers to the practice of distributing software applications over high-speed Internet connections from remote data centers so that customers can used them on any device with online access. Big Data is a broad term for hardware storage and other services that help navigate the sea of information flowing in from the increasing amount of work, play, shopping and social interaction happening online.
Both HP and Dell want a piece of the action because cloud computing and Big Data boast higher margins and growth opportunities than the PC business.

It's not an impossible transition, as demonstrated by the once-slumping but now-thriving IBM Corp., a technology icon even older than HP. But IBM began its makeover during the 1990s under Louis Gerstner and went through its share of turmoil before selling its PC business to Lenovo Group in 2005. HP and Dell are now trying to emulate IBM, but they may be making their moves too late as they try to compete with IBM and Oracle Corp., as well as a crop of younger companies that focus exclusively on cloud computing or Big Data.

A revival at HP will take time, something that HP CEO Meg Whitman has repeatedly stressed during her first 11 months on the job.

"Make no mistake about it: We are still in the early stages of a turnaround," Whitman told analysts during a conference call last week.

The problems Whitman is trying to fix were inherited from Apotheker and Hurd.

HP hired Apotheker after he was dumped by his previous employer. He lasted less than a year as HP's CEO – just long enough to engineer an $11 billion acquisition of business software maker Autonomy, another poorly performing deal that is threatening to lump HP with another huge charge.

Before Apotheker, Hurd won praise for cutting costs during his five-year reign at HP, but Marshall believes HP was too slow to respond to the mobile computing, cloud computing and Big Data craze that began to unfold under Hurd's watch. HP also started its costly shopping spree while Hurd was CEO.
How much further will HP and Dell fall before they hit bottom?

HP's revenue has declined in each of the past four quarters, compared with the same period a year earlier, and analysts expect the trend to extend into next year. The most pessimistic scenarios envision HP's annual revenue falling from about $120 billion this year to $90 billion toward the end of this decade.
The latest projections for PC sales also paint a grim picture. The research firm IDC now predicts PC shipments this year will increase by less than 1 percent, down from its earlier forecast of 5 percent.

Whitman is determined to offset the crumbling revenue by trimming expenses. She already is trying to lower annual costs by $3.5 billion during the next two years, mostly by eliminating 27,000 jobs, or 8 percent of HP's work force.

Marshall expects Whitman's austerity campaign to enable HP to maintain its annual earnings at about $4 per share, excluding accounting charges, for the foreseeable future.

If HP can do that, Marshall believes the stock will turn out to be a bargain investment, even though he isn't expecting the business to grow during the next few years. The shares were trading around $17.50 Monday, near their lowest level since 2004.

One of the main reasons that Marshall still likes HP's stock at this price is because of the company's quarterly dividend of 13.2 cents per share. That translates into a dividend yield of about 3 percent, an attractive return during these times of puny interest rates.

Dell's stock looks less attractive, partly because its earnings appear to still be dropping. The company, which is based in Round Rock, Texas, signaled its weakness last week, when it lowered its earnings projection for the current fiscal year by 20 percent.

Dell executives also indicated that the company is unlikely to get a sales lift from the Oct. 26 release of Microsoft Corp.'s much-anticipated makeover of its Windows operating system. That's because Dell focuses on selling PCs to companies, which typically take a long time before they decide to switch from one version of Windows to the next generation.

Dell shares slipped to a new three-year low of $11.15 during Monday's trading.

As PC sales languish, both HP and Dell are likely to spend more on cloud computing, data storage and technology consulting.

Although those look like prudent bets now, HP and Dell probably should be spending more money trying to develop products and services that turn into "the next new thing" in three or four years, said Erik Gordon, a University of Michigan law and business professor who has been tracking the troubles of both companies.
"It's like they are both standing on the dock watching boats that have already sailed," Gordon said. "They are going to have to swim very fast just to have chance to climb back on one of the boats."

Evaluating Networking Options for HPC & Cloud

InfiniBand (IB) and High-Speed Ethernet (HSE) interconnects are generating a lot of excitement towards building next generation scientific, enterprise and cloud computing systems. OpenFabrics stack is emerging to encapsulate both IB and Ethernet in a unified manner, and hardware technologies such as Virtual Protocol Interconnect (VPI), RDMA over Converged Enhanced Ethernet (RoCE) are converging the hardware solutions.

In this video recorded at Hot Interconnects 2012 in Santa Clara last week, Jerome Vienne from Ohio State University presents: Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems.


We evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middlewares. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.”

Update:
  • 2012.08.27 - original post

Thursday, August 23, 2012

Reaching a Milestone at IBM - Senior Certification

After a 3-month journey with 50-page package, 8 references and 3 interviews, I finally crossed a mile stone at IBM today to have my package for thought leader level in actualizing IT solution approved by the review board.

Achieving this level qualifies me for IBM IT Specialist Senior Certification and Open Group Level Three IT Specialist Certification.

I want to thank all the board reviewers, mentors, coaches, reference supporters and colleagues in the last  ten years. It is a great honor to earn a check mark from IBM and I'd like to share the joy with you all!

Wednesday, August 22, 2012

Amazon Offers Up Data Archiving in Cloud

Amazon.com  recently announced a cloud storage solution from Amazon Web Services (AWS), further expanding its cloud offerings. It is interesting to note that as an archival service, the pricing model discourage frequent access with surcharge on frequency and bandwidth.

This new service is named Amazon Glacier and is a low-cost solution for data archiving, backups and other long-term storage projects where data is not accessed frequently but needs to be retained for future reference.

The cost of the service starts from one cent per gigabyte per month, with upload and retrieval requests costing five cents per thousand requests and outbound data transfer (i.e. moving data from one AWS region to another) costing 12 cents per gigabyte.

Companies usually incur significant costs for data archiving. They initially make an expensive upfront payment, after which they end up purchasing additional storage space in anticipation of growing backup demand, leading to under-utilized capacity and wasted money. With Amazon Glacier, companies will be able to keep costs in line with actual usage, allowing managers to know the exact costs of their storage systems at all times.

Cloud storage came into prominence in 2009, with Nirvanix and Amazon's Simple Storage Service (S3) being two of the major pioneers. Since then, Amazon has continued to dominate the space, with other players like Rackspace (RAX) and Microsoft (MSFT) offering their own solutions.


Tracking NGS IT Technologies

On my Smarter NGS website, I added a section today to follow and track the development of Next-Gen Sequencing technologies such as GPU, Hadoop and cluster. The first area to start is Hadoop.

 

jPage: HPC Cloud Providers

I will start a post to compile a list of HPC cloud service providers. These are vendors that provide HPC Platform as a Service such that users can sign up and run HPC workload in a public cloud without owning any infrastructure on premise.

Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.  (EC2 HPC Feature)

ServedBy.net
Cloud HPC Infrastructure from ServedBy the Net can sustain the most intensive HPC applications with a predictable cost based on your needs. We manage the expensive hardware and provide ready-to-run OS templates; which saves you time, money and allows you to focus on the needs of your application.

GreenButton
GreenButton - GreenButton™ is an award winning global software company that specializes in high performance (HPC) cloud computing. The company provides a cloud platform for development and delivery of software and services that enable independent software vendors (ISVs) to move to the cloud and for their users to access cloud resources. GreenButton is Microsoft Corp’s 2011 Windows Azure ISV Partner of the Year and the company is has offices in New Zealand, Palo Alto and Seattle

ProfitBricks
ProfitBricks is the first HPC cloud provider to introduce Infiniband technology as interconnect fabric for its cloud infrastructure. This makes their system the most capable HPC system available from the Cloud.


Nimbix
Nimbix is a provider of cloud-based High Performance Computing infrastructure and applications. Nimbix offers HPC applications as a service through the Nimbix Accelerated Compute Cloud ™, dramatically speeding up data processing for Life Sciences, Oil & Gas and Rendering applications. Nimbix operates unique high performance hybrid systems and accelerated servers in its Dallas, Texas datacenter.


Update:
  • 2012.08.22 - original post
  • 2012.08.24 - added GreenButton
  • 2012.09.12 - added ProfitBricks
  • 2012.09.18 - added Nimbix

eMagazine Explores Convergence of HPC, Big Data and Cloud

The latest issue of Journey to Cloud, Intel’s cloud computing eMagazine, is hot off the presses and ready to download. In this issue, writers explore key topics like alternative solutions for big data scale-out storage, Hadoop, next-generation cloud management, cloud security, HPC on demand, and more.

You can access or download the eMagazine for free from Intel.



jTool: i2b2 for Genomic Medicine

Informatics for Integrating Biology and the Bedside (i2b2) is one of seven projects sponsored by the NIH Roadmap National Centers for Biomedical Computing (http://www.ncbcs.org).

Its mission is to provide clinical investigators with the tools necessary to integrate medical record and clinical research data in the genomics age, a software suite to construct and integrate the modern clinical research chart. i2b2 software may be used by an enterprise's research community to find sets of interesting patients from electronic patient medical record data, while preserving patient privacy through a query tool interface.

Project-specific mini-databases ("data marts") can be created from these sets to make highly detailed data available on these specific patients to the investigators on the i2b2 platform, as reviewed and restricted by the Institutional Review Board.

The current version of this software has been released into the public domain.

Published use cases:
Links:

Update:
  • original post: 2012.08.22

Wednesday, August 15, 2012

Running HPC Workload on Cloud - A Real-life Case Study

Running high-performance computing (HPC) workload in public cloud has been an interesting yet challenging issue for many researchers and research institutes. The aspect of having large computational power on the finger tip with moment's notice is alluring yet the potential pitfall of poor performance and high cost can be scary.

I came upon a thoughtful and insightful blog post from "Astronomy Computing Today". It presented a research by Ewa Deelman of USC and G. Bruce Berriman of CaTech on the interesting topic of running HPC workload on Cloud. In this case, running astronomical applications on Amazon EC2 HPC Cloud. 

The conclusion of the research are two folds: 1) the cost-performance to run on Cloud depends on application requirement and needs to be carefully evaluated; 2) for now, avoid workload that demands mass storage as this is still the most expensive aspect of running on public cloud such as EC2.

You can access the post here.

Update:
  • 2012.08.15: original post

Monday, August 13, 2012

Bigger, Faster and Cheaper - PetaStore Case Study

In 2010, I architected the Petascale Active Archive solution for University of Oklahoma which was implemented as the PetaStore system in 2011. It is a combined disk-tape active archive solution. The system was in production for about a year and a case study was published by IBM today.

Since 1890, the University of Oklahoma (OU) has provided higher-level education and valuable research through its academic programs. With involvement in science, technology, engineering and mathematics, the university has increased its focus on high performance computing (HPC) to support data-centric research. In service of OU’s education and research mission, the OU Supercomputing Center for Education & Research (OSCER), a division of OU Information Technology, provides support for research projects, providing an HPC infrastructure for the university.
 
Rapid data growth in academic research
 
In a worldwide trend that spans the full spectrum of academic research, explosive data growth has directly affected university research programs. Including a diverse range of data sources, from gene sequencing to astronomy, datasets have rapidly grown to, in some cases, multiple petabytes (millions of gigabytes).
 
One ongoing research project that produces massive amounts of data is conducted by OU’s Center for Analysis and Prediction of Storms. Each year, this project becomes one of the world’s largest storm forecasting endeavors, frequently producing terabytes of data per day. Much of this real-time data is shared with professional forecasters, but a large amount is stored for later analysis. Long-term storage of this data holds strong scientific value, and in many cases, is required by research funding agencies. Understandably, storage space had become a major issue for the university.
 
Need for an onsite storage system
 
In the past, for projects like storm forecasting, OU did not have the capability to store large amounts of data on campus—much of the data had to be stored offsite at national supercomputing centers. This not only created issues for performance and management at the university, it also forced researchers to reduce the amounts of data for offsite storage, creating potential for loss of information that could be valuable for future analysis.
 
Henry Neeman, director of OSCER, realized that to continue supporting many of the university’s research projects—and to retain funding—OU would need a large scale archival storage system that enabled long term data storage while containing costs for deployment and operations.
 
With a clear vision for the new storage system, OU began reviewing bids from multiple vendors. Neeman noticed that while most proposed solutions were technically capable, the IBM solution was able to meet technical requirements and stay within budget. Ultimately, it offered the best value to the university and would go on to establish a powerful new business model for storage of research data.
 
High-capacity, cost-effective data archive
 
Implementing a combination of disk- and tape-based storage, OU was able to establish a storage system known as the Oklahoma PetaStore, which is capable of handling petabytes (PB) of data. For high-capacity disk storage, the IBM System Storage DCS9900 was selected—which is scalable up to 1.7 PB. For longer-term data storage, OU chose the System Storage TS3500 Tape Library—with an initial capacity up to 4.3 PB and expandable to over 60 PB. To run these storage systems, six IBM System x3650 class servers were selected, running IBM General Parallel File System (GPFS™) on the disk system and IBM Tivoli Storage Manager on the tape library to automatically move or copy data to tape.
 
Neeman says one of the main reasons they chose IBM was the cost effectiveness of the tape solution. Unlike the TS3500 and Tivoli Storage Manager, many other tape solutions impose additional cost, such as tape cartridge slot activation upcharges and per-capacity software upcharges—demands that could be prohibitive to researchers. The TS3500 Tape library offers a flexible upgrade path, enabling users to easily and affordably expand the initial capacity. These savings even enabled OU to implement a mechanism to access and manage backup data through extensible interfaces. OU has adopted an innovative business model under which storage costs are shared among stakeholders. In this model, a grant from the National Science Foundation pays for the hardware, software and initial support; OU covers the space, power, cooling, labor and longer-term support costs; and the researchers purchase storage media (tape cartridges and disk drives) to archive their datasets, which OSCER deploys and maintains without usage upcharges.
 
Storage that impresses on many levels
 
The PetaStore provides research teams with a hugely expandable archive system, allowing data to be stored through several duplication policy choices that are set by the researchers. The connectivity capabilities allow data to be accessible not only to the university, but to other institutions and collaborators.
 
Although capacity was more of a priority than speed when designing the PetaStore, this IBM solution has shown strong performance, with tape drives operating close to peak speed. Another key benefit to the solution is its cost-effectiveness—not only for hardware, but for the reduction of labor costs for the researchers. These benefits have been noticed by Neeman, who says, “Without the PetaStore, several very large scale, data-centric research projects would be considerably more difficult, time consuming and expensive to undertake—some of them so much so as to be impractical.”
 
Continued innovation with IBM
 
By choosing the IBM solution for the PetaStore project, the University of Oklahoma has ensured a future of continued innovation in academic research. The system not only facilitates storage for the entire lifecycle of research data, it ensures that the PetaStore can continue operating and expanding at very low cost. This is critical for the university to continue to receive funding—the solution’s built-in cost efficiency proves to research funding agencies that the university can continue to operate the storage system within budget.

Overall, the university and research teams have seen numerous advantages to the IBM solution, and plan for it to seamlessly expand along with their storage needs. According to Neeman, "We only needed three things: bigger, faster and cheaper," and the IBM solution was able to deliver on all fronts. Neeman predicts that data storage solutions like the Oklahoma PetaStore will become increasingly common at research institutions across the country and worldwide.

Source: