Most of the time, not enough sample clinical data is available due to various reasons. This sample data is essential for running analytics for development and testing. Rajeev Gangal discusses the way they tackled their need for generating simulated data for a project. Read on to know more.
To paraphrase a recent statement by The Economist, ‘Data is the New Oil.’ Unfortunately, companies often have limited access to it. Often, the amount of real available data is not enough to run typical analytics and data science algorithms to obtain insights based on KPIs.
Simulation is a methodology that addresses this gap between data and actionable insights by generating data, which has properties similar to the original sample data.
Clinical data, especially lab data, is unavailable in the public domain, as it is a critical intellectual property for the sponsor and sensitive data from a participating patient’s viewpoint. Sometimes, very limited anonymized samples (50 subjects) may be available and simulating and generating data for more subjects to help development and testing of Patient (Clinical) Data repositories. It may be argued that data can also be curated manually; however, that can only be done for small datasets, which may miss the statistical nature of actual data.
Simulation Use Cases
Let’s look at ways to simulate data.
Take an example of a particular study for a particular condition, where sample patient lab data for a particular lab test (e.g., BP/Pulse rate) or procedure is available, but not for demographic and other variables. We can derive this missing data by using univariate statistical methods and machine learning.
- If sample patient lab data is available for a particular lab test along with demographic and other variables, multivariate statistical methods, ensemble, and ML methods can be used.
- If no sample data is available, a user can supply mean/median, lower/upper bounds and appropriate statistical distribution for simulation.
The focus, in this case, is on univariate lab data simulation, which includes parameters like BPSYS, Pulse Rate over multiple visits. The requirement of simulating data over multiple visits for each patient increases complexity, since independence between subsequent visits cannot be assumed. It’s implied that there has to be a relationship between different time points as patients are recruited, randomized, treated, and adapted to the treatment.
Our algorithm thus used two different approaches for simulation followed by a meta-approach to generate the final data.
A. Simulation of Independent Visit data
- For each visit, several continuous independent statistical distributions estimating pulse rate of sample subjects are fitted.
- The best distribution fit is selected using the goodness of fit measures like AIC (Akaike information criterion), BIC (Bayesian information criterion), and log-likelihood values. In the present case, the best distribution was logistic, while the normal model was nearly as good.
- Population parameters are estimated for best-fitted distribution with bootstrapping method.
- To generate the data, these population parameters are used with truncated best fit distribution.
B. Simulation of Difference between subsequent visits
Similar to the above-mentioned case:
- For each visit, several continuous independent statistical distributions estimating differences in pulse rate for same sample subjects are fitted. Thus, we fit for distributions to account for 5 visits. E.g., V2-V1, V3-V2, V4-V3, and V5-V4. Visit V1 is the same as earlier.
- Best distribution fit is selected using the goodness of fit measures like AIC, BIC, log-likelihood values. In the present case, the best distribution was logistic.
- With this distribution, pulse rate visit difference data is generated from the best fit distribution using bootstrap population parameters. The original pulse rate is obtained by adding this difference to the mean.
Typically, hundreds or thousands of bootstrap samples were created from a model to get a distribution of the statistic.
Rather than directly generating a simulated dataset for a thousand or ten thousand patients, we assessed our models with the same sample size as the original data. Results are shown in image 1 for all approaches. None of the two approaches independently gave accurate estimates, but each was accounted for as part of the variation in the pulse rate distributions.
At this juncture, we decided to use linear regression and other methods to combine the two approaches. The training set was composed of original lab values repeated 10 times with the simulation models also run 10 times to estimate these values. However, the results were still not very accurate. Increasing the estimation space of the models seemed like a good idea.
We increased the training set size to 100x by repeating the original pulse rate as a dependent variable. 100 times is also the number of times we estimatedthese values using Approaches A and B. The regression gave us results that were quite similar to the original dataset. These final results are shared below.
We then used this meta-model to simulate pulse rate for 1,000 and 10,000 subjects. The statistical nature of the density curves and other measures for estimation errors reproduced this lab data beautifully.
Probability density for 10k patients Vs Grey area for 144 patients
Image 1: Ratio of simulated population (10K) below/above key point compared to original sample
The simulation was performed using custom code in R and some R libraries like “fitdistplus” and “boot”. We believe that it accurately captures the nature of multi-visit lab data for clinical trial subjects and can be reliably used to generate synthetic data when the original sample is not sufficient.
We also proceeded to use this data to calculate KPIs related to pulse rate, such as, number of patients below mean, 2nd quartile, etc., and it gave consistent results, thus proving its value as a simulation tool. We are internally using this tool to simulate data for our Patient Data Repository solution, which is a part of our CDaaS infrastructure.
Simulation is very useful in what-if analyses of clinical trials. Several commercial tools, like EAST™, evaluate impact of protocol changes, site characteristics, and effectiveness of design. Unlike these tools, our approach focuses on lab data generation to enable development and validation of high quality clinical data analytics solutions.
Sanjay Sane for the statistical inputs and Manasi Shrotri for developing the code and contributing to the statistical thought process.
Write to firstname.lastname@example.org to receive a copy of this code to generate simulated data from your own datasets.