Wednesday, April 30, 2014

A New Era for Genomics - My Talk at BioIT World

In a talk today at BioIT World with a roughly100 audience, I gave an overview of PowerGene v7.5 and the latest update on things we are working on - scaling to PB of storage, speeding up to hours for completing the NGS pipeline and smarting up to the data by linking and analyzing them.

It's an exciting time as we are entering a new era for genomics - Genomic 2.0 for converged genomics with clinical and real-world data to deliver insight and knowledge.

You can download the presentation here

Sunday, April 13, 2014

Releasing IBM Reference Architecture for Genomics v7.5.0

I am releasing IBM Reference Architecture for Genomics version 7.5.0 today.

The  reference architecture covers three key areas of genomic medicine: 1) genomics platform to handle next-gen sequencing and data analysis, 2) translational research for integrating and analyzing data from clinical and genomic studies, and 3) personalized medicine for analyzing and applying targeted (precision) medical and therapeutic treatment based on individual profile.

The updates include: adding Provenance box into PG-orchestrator, unified PG-datahub and PG-access for all three platforms. Renamed top platform to Personalized Medicine (from Watson in previous version).

You can download the pdf version for free here.

Wednesday, April 9, 2014

How about a True Workflow Engine? Introducing PowerGene Orchestrator

While doing my post-doc research at WashU Genome Center and Genetics Department in the early days of Human Genome Project, I wrote and developed a computational bioinformatics pipeline for Zebrafish genomic analysis. I used Perl scripting to build an elaborate workflow with dependency and subflow, etc, and triggered the flow manually at Genome Center's HPC cluster.

Sounds familiar? If you are taking on similar challenge, of have already done so and moved on to related tooling such as Galaxy, I have some better news for you.

How about a truly-easy-to-use, yet powerful-to-the-core solution in the PowerGene Orchestrator?

As a footnote: PowerGene (PG) is the codename for IBM Genomic Medicine Reference Architecture that I started in 2011 during a project with MD Anderson. There was no name for it for about a year until the team started calling it PureGene for the length of 2013. The IBM core team finally settled with the name PowerGene early this year and launched it at a team event in NYC.

So here is how the PG-Orchestrator work, from an end-user point of view:

1. The user uses Flow Editor (a Windows desktop app) to create a flow definition, say for a genomic pipeline from CASAVA, BWA, Samtool to GATK, and submits it to the PG-Orchestrator server

2. PG-Orchestrator server stores the flow definition in its working directory.

3. When the flow is triggered, PG-Orchestrator server manages the dependencies within the flow. When a job is ready to be run, PG-Orchestrator server submits it to HPC master host.

4. The HPC master host manages any resource dependencies the job may have, and dispatches the job to an appropriate compute host (physical server, virtual machine, cluster or cloud)

5. When the job runs, the compute host sends the status of the job to the HPC  master host, which writes the job status to an event log

6. The PG-Orchestrator server reads the event log periodically to obtain the status of the jobs it submitted.

7. The PG-Orchestrator server uses the status of the job to determine the next appropriate action in the flow.

8. During the job/flow run, users can monitor the status through a web portal or command-line. If the workflow stopped for whatever reason, the user can open the web browser to evaluate the status or investigate the cause of stoppage.

So that's the conceptual use case for the PowerGene Orchestrator. If you are interested, just drop me a note and we can follow up. I will also show a demo of Workflow Orchestrator at BioIT World coming up in Boston at end of April.