Close Icon
Article Blog June 5, 2019 4 minute read

A Metadata-Driven Clinical Data Management System & Analysis for Application in Scalable Clinical Data Lakes

In clinical research and development, data is central to optimizing clinical trials, especially the way the studies are designed and executed. In order to meet the demand of scientists, operations, and project management teams, life sciences companies have evolved their data collection systems with cloud-based solutions.

Even though traditional ETL is useful when preparing data for use in business-critical applications, it often falls short of providing information in time. For this reason, novel Clinical Data Lakes (CDL) have been designed to store, cleanse, and harmonize data rapidly and specifically for the type of study design. Using parameterized and repeatable data pipelines in a metadata-driven approach to CDL can help scale data lakes by rapidly integrating data from a variety of study designs and enabling more use cases, without having to rebuild data pipelines or redesign data models.

Given the disparate source systems from which they may come, the data ingested by clinical data lakes can be complementary or conflicting in nature. Data pipelines in the CDL must be created in order to handle the complexity of data coming from systems such as:

  • Clinical Trial Management Systems (CTMS)
  • Electronic Data Capture (EDC)
  • Third-party data sources:
    • Files from Application Program Interface (API)-based connection
    • Trial Master File (TMF) systems
    • Electronic Health Records (EHR)
    • Lab data
    • Wearable devices

In addition to the data ingestion burden, data pipelines must be able to perform logic-based operations for the most complex and nuanced of study designs. For example, a Phase 1 dose-finding study has very different data transformation needs from Study Data Tabulation Model (SDTM) into Analysis Data Model (ADaM) versus large-scale outcomes-based studies in Phase IV. More so, these variations expand based on the type of disease being studied and treated and the type of investigational strategy used.

Biostatisticians have to innately know these differences and repeat the various processing tasks without the ability to share knowledge of what, why, and how certain data processing elements were designed and implemented. Therefore, repeatable data pipelines that leverage meta-data structures mapped to processing tasks specific to disease and intervention are game-changers in the trial data management space.

Saama and Amazon Web Services (AWS) have collaborated to optimize metadata-driven clinical data lakes for the life sciences industry. Key components of clinical data lakes using a metadata-driven clinical data management system are:

  • Metadata repository
  • Metadata identification, parsing service
  • AI/ML models as a service for inference
  • Workflow automation service

By overlaying Saama’s Life Science Analytics Cloud (LSAC) on AWS cloud computing services such as Amazon S3, Amazon EC2, Amazon EMR, AWS Lambda, and AWS Identity and Access Management, multiple pipelines for operational, clinical, real-world evidence (RWE) and commercial data can be run on a clinical data lake to cater to a variety of use cases, including:

  • Planning
  • Conduct
  • Review and monitoring
  • Report outcomes

Such architecture enables the ingestion, standardization, transformation, and reading of datasets without creating redundancy and support or maintenance overheads. Furthermore, the aggregation of derived information to drive safety, clinical science, health economics and other use cases are part of the pipelines within the architecture.


Recently, Saama successfully implemented a metadata-driven approach to a clinical data lake for a customer whose challenge was to integrate multiple data sources for operations and patient analytics, and was unable to parse key information from each study due to varied data formats. Saama used LSAC for a metadata-driven approach that integrated the different data sources and automated the integration and integrity checks, rather than relying on the incoming format. Saama’s quick implementation of this solution, which was three times faster than industry benchmarks, enabled our customer reach their insights more quickly. The customer can also continually leverage the pipelines Saama built to onboard new studies and source systems without needing to profile sources or build additional pipelines.

Saama also extended these pipelines to include machine learning components such as safety signaling, outlier detection, biomarker response, and others.

In summary, a world-class solution like LSAC ensures the realization of a metadata-driven approach to clinical data lakes through five components:

  • Using best-in-breed technology to deal with large and varied data sets
  • Blending that technology with appropriate domain centricity
  • Partnering with the right ecosystem for a wholistic view of the data enterprise
  • Leveraging artificial intelligence and deep learning in a targeted way to address meaningful and difficult issues
  • Aligning with partners like AWS to enhance our security and compliance posture.

Such a metadata-driven approach to clinical data lakes will help companies in clinical development focus on improving patient outcomes with faster access to processed information and machine analysis, without worrying about data management-related activities.

Watch the webinar, Metadata-Driven Approach for Clinical Data Lakes on demand at Xtalks.

Saama can put you on the fast track to clinical trial process innovation.