Dr Frank Blog: May 2015

Tuesday, May 26, 2015

Published! My First eBook on Genomics

My first ebook on genomics was published by IBM Redbook last week. It was based on and expanded from a couple of writings I had in 2014 (editorial, solution brief and whitepaper) on the subject of genomics reference architecture (@Genetecture) and its applicability in building IBM genomics infrastructure leveraging the architecture and ecosystem.

Below is the abstract of the publication titled "IBM Reference Architecture for Genomics: Speed, Scale, Smarts":

Genomic medicine promises to revolutionize biomedical research and clinical care. By investigating the human genome in the context of biological pathways, drug interaction, and environmental factors, it is now possible for genomic scientists and clinicians to identify individuals at risk of disease, provide early diagnoses based on biomarkers, and recommend effective treatments.

However, the field of genomics has been caught in a flood of data as huge amounts of information are generated by next-generation sequencers and rapidly evolving analytical platforms such as high-performance computing clusters.

This data must be quickly stored, analyzed, shared, and archived, but many genome, cancer and medical research institutions and pharmaceutical companies are now generating so much data that it can no longer be timely processed, properly stored or even transmitted over regular communication lines. Often they resort to disk drive and shipping companies to transfer raw data to external computing center for processing and storage, creating an obstacle for speedy access and analysis of data.

In addition to scale and speed, it is also important for all the genomics information to be linked based on data models and taxonomies, and to be annotated with machine or human knowledge. This smart data can then be factored into the equation when dealing with genomic, clinical, and environmental data, and be made available to a common analytical platform.

To address the challenging needs for speed, scale, and smarts for genomic medicine, an IBM® end-to-end reference architecture has been created that defines the most critical capabilities for genomics computing: Data management (Datahub), workload orchestration (Orchestrator), and enterprise access (AppCenter).

The IBM Reference Architecture for genomics can be deployed with various infrastructure and informatics technologies. IBM has also been working with a growing ecosystem of customers and partners to enrich the portfolio of solutions and products that can be mapped into the architecture.
This IBM Redpaper™ publication describes the following topics:

Overview of IBM Reference Architecture for Genomics

Datahub for data management

Orchestrator for workload management

AppCenter for managing user interface

You can access and download the ebook here

Monday, May 4, 2015

Airport vs Helipad - A Few Thoughts on Datahub

I was asked often to compare and contrast a software-defined storage (like GPFS/Spectrum Scale) with a closed-in solution like Panasas, BlueArc or Isilon. Instead of providing a product/offering level analysis, I'd just talk about the difference at the architectural level.

GPFS/Spectrum Scale are part of IBM's software defined storage family. It is also one of the solutions under PowerGene Datahub -- a software-based abstraction layer for storage and data management. The scalability, extensibility and flexibility are the top three hallmarks of Datahub. It will be extremely challenging for a close-in storage solution to come close to just two out of these three criteria.

1) Datahub defines a storage management capability that is extremely scalable in terms of capacity and performance. We are talking about possibly exabytes of data with GB/sec performance with trillions of metadata. The I/O should be well managed based on policy and metadata so that linear scalability can be accomplished. As a real example, if we start with one building block delivering 10GB/sec performance, I will challenge any closed-in solution to deliver 300GB/sec performance with 30 building blocks connected together -- almost 100% linear scalability. We actually proved this could be done with a Federal Lab using Datahub-based GPFS solution.

2) Datahub defines capabilities beyond I/O & storage management for functions such as data movement (policy-based tiering), sharing (policy-based caching or copying) and metadata, each of which can be extended seamlessly from a local storage cluster to a grid and to public cloud. Given the break-neck speed of technological and research advancement in genomic medicine and high-performance computing, any R&D institutions should expect and demand this level of flexibility -- matching use/business cases as much as possible at software/architecture level to minimize locking-in of specific hardware (disk & processor) and vendors (including my own company, IBM).

3) As a software-defined architectural element, Datahub can be leveraged for many other useful infrastructure building blocks -- a) Flash-based building block for high-performance metadata management (scanning 600 million file < 10min at a leading research center in NYC; b) GPFS/tape active archive that can drastically reduce the cost of storage explosion while providing quick/easy access to data under the single global name space for archive and protection -- on this front, Datahub solution can be on par or even beat cloud-based cold storage.

If PowerGene helps building (or leasing) airports (Datahub), planes/engines (Orchestrator) and traveller portal (AppCenter) that can extend from the Ground into the Cloud, these closed-in or NAS-like solutions maybe seen as a pre-assembled helipad -- it's easy to get there but hard to go far and beyond.