Dr Frank Blog: How about a True Workflow Engine? Introducing PowerGene Orchestrator

Wednesday, April 9, 2014

How about a True Workflow Engine? Introducing PowerGene Orchestrator

While doing my post-doc research at WashU Genome Center and Genetics Department in the early days of Human Genome Project, I wrote and developed a computational bioinformatics pipeline for Zebrafish genomic analysis. I used Perl scripting to build an elaborate workflow with dependency and subflow, etc, and triggered the flow manually at Genome Center's HPC cluster.

Sounds familiar? If you are taking on similar challenge, of have already done so and moved on to related tooling such as Galaxy, I have some better news for you.

How about a truly-easy-to-use, yet powerful-to-the-core solution in the PowerGene Orchestrator?

As a footnote: PowerGene (PG) is the codename for IBM Genomic Medicine Reference Architecture that I started in 2011 during a project with MD Anderson. There was no name for it for about a year until the team started calling it PureGene for the length of 2013. The IBM core team finally settled with the name PowerGene early this year and launched it at a team event in NYC.

So here is how the PG-Orchestrator work, from an end-user point of view:

1. The user uses Flow Editor (a Windows desktop app) to create a flow definition, say for a genomic pipeline from CASAVA, BWA, Samtool to GATK, and submits it to the PG-Orchestrator server

2. PG-Orchestrator server stores the flow definition in its working directory.

3. When the flow is triggered, PG-Orchestrator server manages the dependencies within the flow. When a job is ready to be run, PG-Orchestrator server submits it to HPC master host.

4. The HPC master host manages any resource dependencies the job may have, and dispatches the job to an appropriate compute host (physical server, virtual machine, cluster or cloud)

5. When the job runs, the compute host sends the status of the job to the HPC master host, which writes the job status to an event log

6. The PG-Orchestrator server reads the event log periodically to obtain the status of the jobs it submitted.

7. The PG-Orchestrator server uses the status of the job to determine the next appropriate action in the flow.

8. During the job/flow run, users can monitor the status through a web portal or command-line. If the workflow stopped for whatever reason, the user can open the web browser to evaluate the status or investigate the cause of stoppage.

So that's the conceptual use case for the PowerGene Orchestrator. If you are interested, just drop me a note and we can follow up. I will also show a demo of Workflow Orchestrator at BioIT World coming up in Boston at end of April.

Dr Frank Blog

Wednesday, April 9, 2014

How about a True Workflow Engine? Introducing PowerGene Orchestrator

No comments:

Post a Comment