Dr Frank Blog: 2014

Thursday, December 25, 2014

History of IBM Reference Architecture for Genomics

About two years ago, I initiated an IBM project with MD Anderson Cancer Center to develop a research computing infrastructure for cancer genomics.

The reference architecture was born out of necessity from that project -- there are so many applications, workflow, pipelines to handle with even more choices of infrastructure and informatics technologies, both of which were also evolving quickly. We called the project PureGene -- "Pure" coming from Pure Flex system that hardware platform while "Gene" representing its connection with a gene/genetics/genomics.

As the project developed and matured, the reference architecture took root and gained more traction inside IBM. Starting 2014, a new term "PowerGene" was adopted as "Power" is a better representation of an IBM brand while Pure Flex was being moved as part of the Lenovo divesture.

Throughout 2014, PowerGene continued to gain adoption through biomedical research community and there are 21 institutions and companies using PowerGene as enterprise architecture for research now.

The idea of a reference architecture is for it be both a point-of-time design for any system or platform, and also a blueprint for future expansion and growth. Doesn't sound exactly like what genes or genome are capable of? With that "organic" connection, I will start using the suffix "Gene" to represent all naming of reference architecture, with PowerGene being the first one.

Monday, November 17, 2014

GATK on POWER

Today, Broad Institute announced the release of GATK on IBM POWER system. The HaplotypeCaller analysis can take a long time and this native implementation will leverage new POWER8 system to achieve performance optimization. The optimized code will run on Ubuntu 14 and RHEL7 platforms.

For more information and download the code, visit:

A big thanks to the IBM PowerGene development and ISV team for the work of optimization: Yinhe Cheng and Kathy Tzeng.

Also thanks to my friend Mauricio Carneiro, GATK development team manager, for allowing IBM to take on this challenge when we hatched the idea together during a genomics conference at Doha in June.

More Power to Genomics - scale, speed and SMARTS.

Friday, November 14, 2014

IBM Moves Big Irons for Next-Gen HPC

Earlier today US Department of Energy announced IBM has been awarded a $325 million contract to design and build two supercomputing systems, at Lawrence Livermore and Oak Ridge National Labs, both based on IBM's OpenPOWER technology.

As part of the announcement, Secretary of Energy Ernest Moniz and Sen. Lamar Alexander (R- Tenn.) hosted a press briefing at the US Capitol Building with John Kelly in attendance, along with multiple government leaders.

Notable quotes:

IBM can still move big iron. -Forbes

The systems to be installed at Lawrence Livermore and Oak Ridge will use future versions of IBM Power chip line to handle basic computation chores. -- Wall Street Journal

Both systems will be based on next-generation IBM POWER servers with NVIDIA's GPU accelerators and Mellanox's Interconnected technologies "to advance key research initiatives for national nuclear deterrence, technology advancement and scientific discovery," the DOE said. -- Business Standard

Optimized GATK Pipeline

We are wrapping up work and preparing to show case optimized GATK on Power8 next week at SC14. Besides SOAP pipeline, this will be another landmark for PowerGene::Genomics platform to support mission-critical genomics application.

Wednesday, September 17, 2014

Big Data Analytics Meets Molecular Tumor Profiling

Today IBM press published the story of how Caris Life Sciences built its advanced cancer molecular profiling platform based on IBM HPC and Genomics solution.

I was cited in the story:

“Precision cancer diagnosis and treatment requires the ability to process and analyze staggering amounts of genomic and other clinical data with scale and speed, and to accelerate this work, we need to provide clinicians and analysts with technical computing platforms,” said Frank N. Lee, Ph.D., Lead Architect, Genomic Medicine, IBM. “IBM technology has helped Caris handle this data volume with greater speed, efficiency and scalability, and at the same time maintain the required security and reliability essential for handling medical and health data.”

The press release can be found on IBM.com

In a companion blog post by Dr. George Poste of Caris Life Sciences, he explains the details of cancer molecular tumor profiling.

Friday, August 8, 2014

Speed, Scale, SMARTS - High Performance Genomics

The last time I had writing published was during the academic years of my career -- two papers in top journals of EMBO (Europe) and PNAS (US). Fourteen years later, I returned to the published writing with this editorial article in the current issue of IBM Systems Journal (August 2014).

In the article, I offered the first descriptive view of what I envision to be the key capabilities of genomics computing architecture (code named "PowerGene") - Datahub for data storage and management, Orchestrator for workload and workflow management; and AppCenter for managing, sharing and access data and workloads. The three capabilities altogether enabling a data-centric, software-defined and application-ready infrastructure.

The digital version of the editorial can be accessed online or downloaded as pdf. Readers can also view a simple bio page of mine at System Journal here.

In this issue of System Magazine, there is also an interview with Lynda Chin, MD, Chair of genomic medicine at MD Anderson Cancer Center on how her team and IBM worked together to transform healthcare.

Thursday, June 19, 2014

Beyond Big Data - Building Genomics Hub (Talk at 4th GTC NGS Conference)

In my feature speaker talk at 4th GTC NGS conference last month, I listed 4 reasons why building a genomics hub is a challenge greater than dealing with big data. They all have to do with the unique characteristics of genomic data.

1. data carries speed as they come down the pipe in various shape, form and velocity
2. data has temperature as they cycle through the pipeline
3. data needs address as they now need to be shared locally and globally
4. data lives forever - through workload and workflow, and access by users anytime and anywhere

The talk was titled "Beyond Big Data - Challenges & Strategies for Building Genomics Hub". The abstract of the talk is following:

Dr. Frank N. Lee will describe IBM’s open and extensible reference architecture for genomic medicine, including integration of genomic data within a translational platform. The architecture describes a unique, converged platform for high-throughput genomics and analytics through on-premise, cloud and hybrid delivery. It includes a scalable data repository, powerful workload engine and genomics application center for commonly used applications. Finally, we will describe how IBM can help deliver individualized patient care.

It was at the 4th GTC Next-Gen Sequencing Conference in San Diego on June 19th 2014.

Thursday, May 22, 2014

HPDA Orchestrator - Platform LSF

With the success of applying Platform Computing stack to genomics and growing number of scientific disciplines and other fields such as big data analytics, I was frequently asked about the scope and efforts it will take to migrate to or adopt Platform LSF. Below I will compile some documentation and best practice as a reference point:

PBS/Torque to LSF Migration Manual
This is a useful document showing how Durham University migrated from PBS/Torque to LSF. The manual was originally from here

LSF Migration Guide
This is a quick reference to assist system administrator or application support staff in converting MOAB scripts to LSF scripts

Download LSF DRMAA

The latest Platform LSF version has support for DRMAA, which can be downloaded in the link above.

If you have produced similar documentation and would like to share with the scientific community, please send it to me at drfranknlee@gmail.com

Wednesday, May 7, 2014

Sequence the City

Starting today at the IBM Almaden Research Center, scores of scientists and researchers are gathering for a two-day symposium called "Sequence the City: Metagenomics in the Era of Big Data."

The goal of the symposium? To explore how pervasive DNA sequencing across a city (or factory, farm, air travel system, etc.) could support a robust framework for safeguarding human and environmental health. The overarching aim is to articulate the promise and the challenge of genomics in the era of big data.

Metagenomics is the study of mixtures of genetic material corresponding to communities of bacteria, fungi, viruses, and other living things, often co-existing in a microenvironment like on a countertop or in the human gastrointestinal tract. Results from metagenomics research are leading to a new appreciation for the critical role that microbes play in the health of people and the environment. Continued advances in DNA sequencing technology and laboratory sample preparation make it possible to track microbes as they live amongst us, from the typical species and strains, to the malevolent needles in the haystack.

Genomic analysis of food, water, waste, soil, crops, insects, and swabbed surfaces from natural and built environments could provide advanced warning of threats to public health. Biosurveillance and diagnostics applications using DNA sequencing technology will require new algorithms and computational methods. Built upon new I/T services, software applications that digitally identify microbes may replace labor-intensive and time-consuming biochemical assays, effectively moving detection from the wet bench to the computer.

The symposium will focus on a broad spectrum of science and computational topics including the human microbiome, bacterial and viral evolution, agriculture, food safety, and the microbial ecology of buildings (e.g., factories) and even entire cities. The researchers will review some of the computational methods and analyses that are applied to genomic data today, and explore how new genomic data and services in the cloud could enable entirely new applications in the future.

Here is a report on the event from CNET.

Wednesday, April 30, 2014

A New Era for Genomics - My Talk at BioIT World

In a talk today at BioIT World with a roughly100 audience, I gave an overview of PowerGene v7.5 and the latest update on things we are working on - scaling to PB of storage, speeding up to hours for completing the NGS pipeline and smarting up to the data by linking and analyzing them.

It's an exciting time as we are entering a new era for genomics - Genomic 2.0 for converged genomics with clinical and real-world data to deliver insight and knowledge.

You can download the presentation here.

Sunday, April 13, 2014

Releasing IBM Reference Architecture for Genomics v7.5.0

I am releasing IBM Reference Architecture for Genomics version 7.5.0 today.

The reference architecture covers three key areas of genomic medicine: 1) genomics platform to handle next-gen sequencing and data analysis, 2) translational research for integrating and analyzing data from clinical and genomic studies, and 3) personalized medicine for analyzing and applying targeted (precision) medical and therapeutic treatment based on individual profile.

The updates include: adding Provenance box into PG-orchestrator, unified PG-datahub and PG-access for all three platforms. Renamed top platform to Personalized Medicine (from Watson in previous version).

You can download the pdf version for free here.

Wednesday, April 9, 2014

How about a True Workflow Engine? Introducing PowerGene Orchestrator

While doing my post-doc research at WashU Genome Center and Genetics Department in the early days of Human Genome Project, I wrote and developed a computational bioinformatics pipeline for Zebrafish genomic analysis. I used Perl scripting to build an elaborate workflow with dependency and subflow, etc, and triggered the flow manually at Genome Center's HPC cluster.

Sounds familiar? If you are taking on similar challenge, of have already done so and moved on to related tooling such as Galaxy, I have some better news for you.

How about a truly-easy-to-use, yet powerful-to-the-core solution in the PowerGene Orchestrator?

As a footnote: PowerGene (PG) is the codename for IBM Genomic Medicine Reference Architecture that I started in 2011 during a project with MD Anderson. There was no name for it for about a year until the team started calling it PureGene for the length of 2013. The IBM core team finally settled with the name PowerGene early this year and launched it at a team event in NYC.

So here is how the PG-Orchestrator work, from an end-user point of view:

1. The user uses Flow Editor (a Windows desktop app) to create a flow definition, say for a genomic pipeline from CASAVA, BWA, Samtool to GATK, and submits it to the PG-Orchestrator server

2. PG-Orchestrator server stores the flow definition in its working directory.

3. When the flow is triggered, PG-Orchestrator server manages the dependencies within the flow. When a job is ready to be run, PG-Orchestrator server submits it to HPC master host.

4. The HPC master host manages any resource dependencies the job may have, and dispatches the job to an appropriate compute host (physical server, virtual machine, cluster or cloud)

5. When the job runs, the compute host sends the status of the job to the HPC master host, which writes the job status to an event log

6. The PG-Orchestrator server reads the event log periodically to obtain the status of the jobs it submitted.

7. The PG-Orchestrator server uses the status of the job to determine the next appropriate action in the flow.

8. During the job/flow run, users can monitor the status through a web portal or command-line. If the workflow stopped for whatever reason, the user can open the web browser to evaluate the status or investigate the cause of stoppage.

So that's the conceptual use case for the PowerGene Orchestrator. If you are interested, just drop me a note and we can follow up. I will also show a demo of Workflow Orchestrator at BioIT World coming up in Boston at end of April.

Tuesday, March 18, 2014

CU Peta Library Project

Dr. Thomas Hauser from University of Colorado Boulder explains how over 1,500 researchers creating massive amounts of data are supported by the Peta Library Project, with the help of IBM GPFS and its ability to easily move data from tier to tier.

Monday, February 24, 2014

IBM Adds Cloudant to Cloud Ecosystem

IBM announced today a definitive agreement to acquire Boston, MA-based Cloudant, Inc., a privately held database-as-a-service (DBaaS) provider that enables developers to easily and quickly create next generation mobile and web apps.

Cloudant will extend IBM's Big Data and Analytics, Cloud Computing and Mobile offerings by further helping clients take advantage of these key growth initiatives.

http://www-03.ibm.com/press/us/en/pressrelease/43238.wss

Wednesday, January 29, 2014

Launching IBM PowerGene

I am launching the IBM PowerGene v1.0 today in New York City. It is an open, scalable and end-end system framework based on our reference architecture for omics research, translational science and personalized medicine. There will be three flavors initially - Gateway, Cluster and Cloud.