Dr Frank Blog: 2020

Biomedical research institutes and healthcare providers are dealing with an enormous growth of data, mainly unstructured, that is flowing from many sources – faster and faster. All types of data need to be captured, labeled, cleaned, stored, managed, analyzed, cataloged, protected and archived. The disparate data sources, types, ownership and governance create silos that impede data access, drive down efficiency, drive up costs, and slow times to insight and discovery.

The volume and complexity of data also drives the adoption of modern analytical frameworks such as big data (Hadoop and Spark) and AI (machine learning and deep learning) and applying it for the thousands of research and business applications (e.g. genomics, bioinformatics, imaging, translational and clinical). Supporting big data demands for rapidly evolving frameworks and workloads along with the collaborative nature of biomedical research requires comprehensive storage and compute capabilities.

Because of these challenges, it is imperative for the infrastructure and underlying architecture to become agile, data-driven and application-optimized – in short – becoming ready to advance precision medicine.

The Art of Possible

In 2014, I created and made public IBM Reference Architecture for Genomics to take on these challenges. It has evolved by taking on more workloads (eg medical imaging and clinical analytics) and expanded to include AI and Cloud. Since 2018, we renamed the architecture to IBM High Performance Data & AI (HPDA). It grew out of healthcare and life sciences industry use cases and leveraged IBM’s history of delivering best practices in high-performance computing, artificial intelligence and hybrid-multi Cloud. In fact, the basic HPDA framework was used to construct Summit and Sierra – currently two of the world’s most powerful supercomputers designed for data and AI. The architecture is designed to help life science organizations easily scale and expand compute and storage resources independently as demand grows, to ensure maximum performance and business continuity. It supports the wide range of development frameworks and applications required for industry innovation with optimized hardware as a foundation – without unnecessary re-investments in technology.

The HPDA reference architecture is deployable into software-defined infrastructure (SDI) that offer advanced orchestration and management capabilities. Currently it can support major computing paradigms such as traditional HPC, data lakes, large-scale analytics, machine learning and deep learning.

These capabilities then become the foundation for developing and deploying applications for fields such as genomics, imaging, clinical, real-world evidence and Internet-of-things. The HPDA architecture can be implemented on-premise in a local data center or off-premise in a private or public cloud. Our team and clients have also demonstrated and deployed advanced use case and platforms in hybrid cloud.

The architecture has a “Datahub” layer designed to manage the ocean of unstructured data that is siloed in disparate systems using advanced tiering functions, peering, and cataloging. The advanced capabilities allow the data to be captured very rapidly, stored safely, accessed easily and be shared globally in the most secured and regulation-compliant way wherever and whenever the data is needed.

The second layer is the “Orchestrator” which brings efficient scalable computational capabilities based on a shared infrastructure to schedule millions of jobs and deploy policy-driven resource management with critical functions like parallel computing and pipelining for faster time to insights and better outcomes. With the advancement of Cloud technologies such as containers and container orchestration, Orchestrator is now fully capable of deploying and manage Cloud-ready or Cloud-native workloads.

The capabilities of the Datahub and Orchestrator that were designed as two separated abstraction layers can be extended all the way to the cloud. They work together moving data to balance and control the dispatching of workloads avoiding bottlenecks that may cause jobs to run slower. In the latest edition of HPDA, we introduced Unified Data Catalog use case support to further improve the integration of the two layers. Imagine now that every applications will have its data, metadata, provenance and results recorded and tracked automatically. This meta-information will then be fed into a governance catalog or platform to facilitate data exchange, ensure privacy compliance and secure access to data and information.

The logic behind this reference architecture was to free workloads from hardware constraints: assign the optimal resources needed (CPU, GPU, FPGA, VM) and address unpredictable workload requirements. This architecture is portable to different infrastructure providers, deployable to different hardware technologies and allows the workloads to become reusable through validated and hardened platforms.

This true data-driven, cloud-ready, AI-capable solution is based on deep industry experience and constant feedback from leading organizations that are at the forefront of precision medicine.

The Values to Reality

Users and infrastructure providers are achieving valuable results and significant benefits from the HPDA solution.

The key values for users:

Ease-of-use: self-service App Center with a graphical user interface based on advanced catalog and search engines that allows users to manage the data in real-time with maximum flexibility.
High-performance: cloud-scale data management and multicloud workload orchestration allows users to place data where it makes sense and provision the required environment for peak demand periods in the cloud, dynamically and automatically, for as long as needed, to maximize performance.
Low cost: policy-based data management that can reduce storage costs up to 90% by automatically migrating file and object medical data to the optimal storage tier.
Global collaboration: allows multi-tenant access and data sharing that spans across storage systems and geographic locations enabling many research initiatives around the globe with a common reference architecture to establish strategic partnerships and collaborate.

The key values for infrastructure providers:

Easy to install: a blueprint that compiles best practices and enables IT architects to quickly deploy an end-to-end solution architecture that is designed specifically to match different use cases and requirements
Fully tested: IT architecture based on a solid roadmap of future-ready proven infrastructure that can easily be integrated into the existing environment protecting already made investments, especially the hardware purchase and cloud services.
Global Industry Ecosystem: wide ecosystem to align with the latest technologies for hybrid multicloud, big data analytics and AI to optimize data for cost, compliance and performance expected by end users.

Today, there are over 100 enterprise deployment of HPDA in world's largest cancer center, genome center, precision medicine projects, and research hospitals. Some large pharma and biotech have also started to adopted the architecture in their multi-Cloud infrastructure.

To learn more about HPDA and its use cases and adoption in precision medicine and drug discovery, please download and read my IBM Redbook at http://www.redbooks.ibm.com/abstracts/redp5481.html

Dr Frank Blog

Sunday, June 7, 2020

Getting Your Data & App Ready for Precision Medicine