Dr Frank Blog: 2015

Tuesday, May 26, 2015

Published! My First eBook on Genomics

My first ebook on genomics was published by IBM Redbook last week. It was based on and expanded from a couple of writings I had in 2014 (editorial, solution brief and whitepaper) on the subject of genomics reference architecture (@Genetecture) and its applicability in building IBM genomics infrastructure leveraging the architecture and ecosystem.

Below is the abstract of the publication titled "IBM Reference Architecture for Genomics: Speed, Scale, Smarts":

Genomic medicine promises to revolutionize biomedical research and clinical care. By investigating the human genome in the context of biological pathways, drug interaction, and environmental factors, it is now possible for genomic scientists and clinicians to identify individuals at risk of disease, provide early diagnoses based on biomarkers, and recommend effective treatments.

However, the field of genomics has been caught in a flood of data as huge amounts of information are generated by next-generation sequencers and rapidly evolving analytical platforms such as high-performance computing clusters.

This data must be quickly stored, analyzed, shared, and archived, but many genome, cancer and medical research institutions and pharmaceutical companies are now generating so much data that it can no longer be timely processed, properly stored or even transmitted over regular communication lines. Often they resort to disk drive and shipping companies to transfer raw data to external computing center for processing and storage, creating an obstacle for speedy access and analysis of data.

In addition to scale and speed, it is also important for all the genomics information to be linked based on data models and taxonomies, and to be annotated with machine or human knowledge. This smart data can then be factored into the equation when dealing with genomic, clinical, and environmental data, and be made available to a common analytical platform.

To address the challenging needs for speed, scale, and smarts for genomic medicine, an IBM® end-to-end reference architecture has been created that defines the most critical capabilities for genomics computing: Data management (Datahub), workload orchestration (Orchestrator), and enterprise access (AppCenter).

The IBM Reference Architecture for genomics can be deployed with various infrastructure and informatics technologies. IBM has also been working with a growing ecosystem of customers and partners to enrich the portfolio of solutions and products that can be mapped into the architecture.
This IBM Redpaper™ publication describes the following topics:

Overview of IBM Reference Architecture for Genomics

Datahub for data management

Orchestrator for workload management

AppCenter for managing user interface

You can access and download the ebook here

Monday, May 4, 2015

Airport vs Helipad - A Few Thoughts on Datahub

I was asked often to compare and contrast a software-defined storage (like GPFS/Spectrum Scale) with a closed-in solution like Panasas, BlueArc or Isilon. Instead of providing a product/offering level analysis, I'd just talk about the difference at the architectural level.

GPFS/Spectrum Scale are part of IBM's software defined storage family. It is also one of the solutions under PowerGene Datahub -- a software-based abstraction layer for storage and data management. The scalability, extensibility and flexibility are the top three hallmarks of Datahub. It will be extremely challenging for a close-in storage solution to come close to just two out of these three criteria.

1) Datahub defines a storage management capability that is extremely scalable in terms of capacity and performance. We are talking about possibly exabytes of data with GB/sec performance with trillions of metadata. The I/O should be well managed based on policy and metadata so that linear scalability can be accomplished. As a real example, if we start with one building block delivering 10GB/sec performance, I will challenge any closed-in solution to deliver 300GB/sec performance with 30 building blocks connected together -- almost 100% linear scalability. We actually proved this could be done with a Federal Lab using Datahub-based GPFS solution.

2) Datahub defines capabilities beyond I/O & storage management for functions such as data movement (policy-based tiering), sharing (policy-based caching or copying) and metadata, each of which can be extended seamlessly from a local storage cluster to a grid and to public cloud. Given the break-neck speed of technological and research advancement in genomic medicine and high-performance computing, any R&D institutions should expect and demand this level of flexibility -- matching use/business cases as much as possible at software/architecture level to minimize locking-in of specific hardware (disk & processor) and vendors (including my own company, IBM).

3) As a software-defined architectural element, Datahub can be leveraged for many other useful infrastructure building blocks -- a) Flash-based building block for high-performance metadata management (scanning 600 million file < 10min at a leading research center in NYC; b) GPFS/tape active archive that can drastically reduce the cost of storage explosion while providing quick/easy access to data under the single global name space for archive and protection -- on this front, Datahub solution can be on par or even beat cloud-based cold storage.

If PowerGene helps building (or leasing) airports (Datahub), planes/engines (Orchestrator) and traveller portal (AppCenter) that can extend from the Ground into the Cloud, these closed-in or NAS-like solutions maybe seen as a pre-assembled helipad -- it's easy to get there but hard to go far and beyond.

Tuesday, April 7, 2015

Fast and Furious Engine for Computing

As we are designing and building PowerGene Pipeline v2 (PG-P2), I started to document the feature and function that can define a true workflow engine that can empower the world of scientific and analytical computing. If our data scientists, researchers, and even technologists desire a next generation of race car to take them to the next level of competition, then there should be minds thinking about inventing and building a fast and furious engine.

So here are my top 10 list for software defined workflow engine (SDWE).

Abstraction — Of workloads from their physical implementation, thus decoupling a resource from its consumer. Abstraction enables definition of logical models of application or workflow (flow definition) that can be instantiated at the time of provisioning, thus enforcing standardization and enabling reusability. .

Orchestration - as applied to workflow, going beyond a single server or a cluster such that workloads with various architectural requirement can be optimally linked to available resources that now become transparent to the users and applications.

Automation — Beyond script-based automation, enabling automation of tasks, jobs and workflow across resource domains with built-in policy management for enforcement and optimization.

Standardization - Of workflows by a common set of naming standards, version control, runtime logging and provenance tracking

Customization - Of workloads into functional building blocks then connecting them into logical network, thus enabling workflow to be quickly composed or recomposed from proven workloads or subflow.

Visualization - Of the runtime environment through graphical user interface, as well as the final output using third-party visualization engine.

Scalability - Leverage world-class software defined storage infrastructure for extreme scalability. Supporting hundreds of pipelines that can run in parallel and scales to hundreds of thousands of concurrent jobs in a shared resource pool.

Manageability - The ability to start, suspend, restart and completely terminate the workflow manually (by user), as well as policy-based management of pipeline events (job success, failure, branching, convergence, etc).

Reusability - The ability to rerun the same pipelines (manually or by policy) or redeploy it as an embedded elements in a higher-level workflow.

Accessibility - Fine-grain, role-based access to make the solution available to only those who needs it.

Saturday, March 21, 2015

Sharing the Journey

I was fortunate enough to one of the 500 Best of IBM winners this year. As a self introduction to the fellow BOI class, I wrote the following on the group forum.

Hi Everyone and Congratulations!

What an honor and how excited it is to be among this distinguished group of IBMers. You know that you are making a difference every day on your journey and I look forward to crossing path with you and celebrating our little milestone on the Island of Maui.

My journey started 25 years ago when I took off from Shanghai, flew over Hawaii and landed in St. Louis as a foreign student. My studies and then research took me through experiments, theories and discoveries of molecular biology and genetics. The turning point came in 1999 when the Human Genome Project started and computer entered the stage. The waves of innovation both scientifically and technologically kept coming and drawn me to the course. I spent the next 12 years exploring the frontier of high performance compting technologies at IBM, first as industry SME, then solution BDE, and finally the lead HPC architect for US. Along the way, I designed probably 50 supercomputers that entered World' Top 500 ranking. Staring 2012, I noted that many of our clients/partners were using the word "genomics" in their RFP or tenders. The era of genomic medicine had finally arrived and I was so excited that two of my paths now merged into one.

Today, I am part of Software Defined Infrastructure (SDI) WW Sales team and its genomics solution initiative code-named PowerGene. Our mission is very simple: when you or your family member goes to a hospital one day to sequence the genome for diagnostic test or treatment monitoring, we want your genomic data to be processed, analyzed and stored based on PowerGene technologies, all done with Speed, Scale and SMARTS.

My BOI nomination is based on my work with one such hospital. It's called Sidra and is being built as a shining star of genomic medicine for Middle East and the World. Baked in the dessert sun of Qatar, yet armed with the coolest PowerGene technologies, Sidra is on a mission to sequence full genome of every citizen in Qatar!

If you know of a hospital, research center, university, agricultural, biotech, pharmaceutical company that is struggling with or becoming interested in technology, give me a shout and I will show up at your local airport before we meeting in Maui. You might work in a totally unrelated brand, field, organization, but you can help in ways you might not even think of.

I am also "recruiting" volunteers as developer, architect, engineer to build many PowerGene solutions such as automated workflow pipeline, massive metadata management system and hybrid cloud for elastic computing. Currently, our startup-ish dev team has 6 members and we have TGIF call every Friday that I can invite you as a guest/observer. You won't need any credential as a member of 2015 BOI class :)

Let me borrow a Chinese saying to end my introduction: 志同者道也合 -- those who share the course, share the journey.

Safe travel to Hawaii!

Finally, my call for like-minded to join the course extends to those who read this blog post :)

Thursday, March 5, 2015

Bad Days

It was a bad day in New York LaGuardia Airport today. The snow was falling down hard then blowing side way fast. Every canceled flight drove panic passengers from gate to gate and in my case one concourse to another. The despair descended into desperation as a Delta plane skidded out of runway and closed down the airport. I made many calls to American Airlines service desk for rebooking and it took longer for every call to go through. It was clear that the surge of calls was overwhelming the airline computer systems and service resources.

While on hold for the call, I suddenly recalled a meeting I had few days ago with a healthcare researcher. He told me about the frustration of getting sufficient and timely computing resources to run cancer genomics analytical pipelines. The problem was more pronoun when there were real patients whose genomic data needed to be analyzed in clinical grade (high-depth) and with fast turnaround (hours vs days).

Looking through the window at the snow plow and truck struggling to flight the blizzard, I asked myself: as a information technologist, can you do more to help this battle? This is a battle waging in every hospital around the world, against a much worse natural enemy that inflicts pain on millions of families including my own.

What if we can build a much capable, faster and smarter computer to aid the battles against cancer? With that, maybe we will be able to turn many bad days of a suffering family into good days. To get there, I won't mind suffering a bad day like today along the journey to frontier.

#NoMoreBadDays

#J2F_PowerGene

#abw4_et1509

Written 12:30AM in a hotel in New York while waiting for another flight home in 12 hours.