Monday, May 4, 2015

Airport vs Helipad - A Few Thoughts on Datahub

I was asked often to compare and contrast a software-defined storage (like GPFS/Spectrum Scale) with a closed-in solution like Panasas, BlueArc or Isilon. Instead of providing a product/offering level analysis, I'd just talk about the difference at the architectural level.

GPFS/Spectrum Scale are part of IBM's software defined storage family. It is also one of the solutions under PowerGene Datahub -- a software-based abstraction layer for storage and data management. The scalability, extensibility and flexibility are the top three hallmarks of Datahub. It will be extremely challenging for a close-in storage solution to come close to just two out of these three criteria.

1) Datahub defines a storage management capability that is extremely scalable in terms of capacity and performance. We are talking about possibly exabytes of data with GB/sec performance with trillions of metadata. The I/O should be well managed based on policy and metadata so that linear scalability can be accomplished. As a real example, if we start with one building block delivering 10GB/sec performance, I will challenge any closed-in solution to deliver 300GB/sec performance with 30 building blocks connected together -- almost 100% linear scalability.  We actually proved this could be done with a Federal Lab using Datahub-based GPFS solution.

2) Datahub defines capabilities beyond I/O & storage management for functions such as data movement (policy-based tiering), sharing (policy-based caching or copying) and metadata, each of which can be extended seamlessly from a local storage cluster to a grid and to public cloud. Given the break-neck speed of technological and research advancement in genomic medicine and high-performance computing, any R&D institutions should expect and demand this level of flexibility -- matching use/business cases as much as possible at software/architecture level to minimize locking-in of specific hardware (disk & processor) and vendors (including my own company, IBM).

3) As a software-defined architectural element, Datahub can be leveraged for many other useful infrastructure building blocks -- a) Flash-based building block for high-performance metadata management (scanning 600 million file < 10min at a leading research center in NYC; b) GPFS/tape active archive that can drastically reduce the cost of storage explosion while providing quick/easy access to data under the single global name space for archive and protection -- on this front, Datahub solution can be on par or even beat cloud-based cold storage.

If PowerGene helps building (or leasing) airports (Datahub), planes/engines (Orchestrator) and traveller portal (AppCenter) that can extend from the Ground into the Cloud, these closed-in or NAS-like solutions maybe seen as a pre-assembled helipad -- it's easy to get there but hard to go far and beyond.


No comments:

Post a Comment