Biomedical
research institutes and healthcare providers are dealing with an enormous
growth of data, mainly unstructured, that is flowing from many sources – faster
and faster. All types of data need to be captured, labeled, cleaned, stored,
managed, analyzed, cataloged, protected and archived. The disparate data
sources, types, ownership and governance create silos that impede data access,
drive down efficiency, drive up costs, and slow times to insight and discovery.
The volume and
complexity of data also drives the adoption of modern analytical frameworks
such as big data (Hadoop and Spark) and AI (machine learning and deep learning)
and applying it for the thousands of research and business applications (e.g.
genomics, bioinformatics, imaging, translational and clinical). Supporting
big data demands for rapidly evolving frameworks and workloads along with the
collaborative nature of biomedical research requires comprehensive storage and
compute capabilities.
Because of these
challenges, it is imperative for the infrastructure and underlying architecture
to become agile, data-driven and application-optimized – in short – becoming
ready to advance precision medicine.
The
Art of Possible
In 2014, I created and
made public IBM Reference Architecture for Genomics to take on these
challenges. It has evolved by taking on more workloads (eg medical imaging and
clinical analytics) and expanded to include AI and Cloud. Since 2018, we
renamed the architecture to IBM High Performance Data & AI (HPDA). It grew
out of healthcare and life sciences industry use cases and leveraged IBM’s
history of delivering best practices in high-performance computing, artificial
intelligence and hybrid-multi Cloud. In fact, the basic HPDA framework was used
to construct Summit and Sierra – currently two of the world’s most powerful
supercomputers designed for data and AI. The architecture is designed to help
life science organizations easily scale and expand compute and storage
resources independently as demand grows, to ensure maximum performance and
business continuity. It supports the wide range of development frameworks and
applications required for industry innovation with optimized hardware as a
foundation – without unnecessary re-investments in technology.
The HPDA reference
architecture is deployable into software-defined infrastructure (SDI) that
offer advanced orchestration and management capabilities. Currently it can
support major computing paradigms such as traditional HPC, data lakes,
large-scale analytics, machine learning and deep learning.
These capabilities then
become the foundation for developing and deploying applications for fields such
as genomics, imaging, clinical, real-world evidence and Internet-of-things. The
HPDA architecture can be implemented on-premise in a local data center or
off-premise in a private or public cloud. Our team and clients have also
demonstrated and deployed advanced use case and platforms in hybrid cloud.
The architecture has a
“Datahub” layer designed to manage the ocean of unstructured data that is
siloed in disparate systems using advanced tiering functions, peering, and
cataloging. The advanced capabilities allow the data to be captured very
rapidly, stored safely, accessed easily and be shared globally in the most
secured and regulation-compliant way wherever and whenever the data is needed.
The second layer is the
“Orchestrator” which brings efficient scalable computational capabilities based
on a shared infrastructure to schedule millions of jobs and deploy
policy-driven resource management with critical functions like parallel
computing and pipelining for faster time to insights and better
outcomes. With the advancement of Cloud technologies such as containers
and container orchestration, Orchestrator is now fully capable of deploying and
manage Cloud-ready or Cloud-native
workloads.
The capabilities of the
Datahub and Orchestrator that were designed as two separated abstraction layers
can be extended all the way to the cloud. They work together moving data to
balance and control the dispatching of workloads avoiding bottlenecks that may
cause jobs to run slower. In the latest edition of HPDA, we introduced Unified
Data Catalog use case support to further improve the integration of the two
layers. Imagine now that every applications will have its data, metadata,
provenance and results recorded and tracked automatically. This
meta-information will then be fed into a governance catalog or platform to
facilitate data exchange, ensure privacy compliance and secure access to data
and information.
The logic behind this
reference architecture was to free workloads from hardware constraints: assign
the optimal resources needed (CPU, GPU, FPGA, VM) and address unpredictable
workload requirements. This architecture is portable to different
infrastructure providers, deployable to different hardware technologies and
allows the workloads to become reusable through validated and hardened
platforms.
This true data-driven,
cloud-ready, AI-capable solution is based on deep industry experience and
constant feedback from leading organizations that are at the forefront of
precision medicine.
The
Values to Reality
Users and infrastructure
providers are achieving valuable results and significant benefits from the HPDA
solution.
The key values for
users:
- Ease-of-use: self-service App Center with a
graphical user interface based on advanced catalog and search engines that
allows users to manage the data in real-time with maximum flexibility.
- High-performance: cloud-scale data management and multicloud
workload orchestration allows users to place data where it makes sense and
provision the required environment for peak demand periods in the cloud,
dynamically and automatically, for as long as needed, to maximize
performance.
- Low cost: policy-based data management that can
reduce storage costs up to 90% by automatically migrating file and object
medical data to the optimal storage tier.
- Global
collaboration: allows multi-tenant
access and data sharing that spans across storage systems and geographic
locations enabling many research initiatives around the globe with a
common reference architecture to establish strategic partnerships and
collaborate.
The key values for
infrastructure providers:
- Easy to install: a blueprint that compiles best practices
and enables IT architects to quickly deploy an end-to-end solution
architecture that is designed specifically to match different use cases
and requirements
- Fully tested: IT architecture based on a solid roadmap of
future-ready proven infrastructure that can easily be integrated into the
existing environment protecting already made investments, especially the
hardware purchase and cloud services.
- Global Industry
Ecosystem: wide ecosystem to
align with the latest technologies for hybrid multicloud, big data
analytics and AI to optimize data for cost, compliance and performance
expected by end users.
Today, there are over 100
enterprise deployment of HPDA in world's largest cancer center, genome center,
precision medicine projects, and research hospitals. Some large pharma and
biotech have also started to adopted the architecture in their multi-Cloud
infrastructure.
To learn more about HPDA
and its use cases and adoption in precision medicine and drug discovery, please
download and read my IBM Redbook at http://www.redbooks.ibm.com/abstracts/redp5481.html