2021
DOI: 10.1136/bmjopen-2020-043497
|View full text |Cite
|
Sign up to set email alerts
|

Can synthetic data be a proxy for real clinical trial data? A validation study

Abstract: ObjectivesThere are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.SettingReplication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.ParticipantsThere were 1543 patients in the control arm that were included in … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
45
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 56 publications
(54 citation statements)
references
References 41 publications
(35 reference statements)
1
45
0
Order By: Relevance
“…Some research suggests that synthetic data can be used as a proxy for the real dataset in analyses [35,40]. We believe that the synthetic data constructed here is accurate enough to make predictions consistent with the real data, but do not advise that it currently be used in place of the real data as a proxy.…”
Section: Discussionmentioning
confidence: 82%
See 1 more Smart Citation
“…Some research suggests that synthetic data can be used as a proxy for the real dataset in analyses [35,40]. We believe that the synthetic data constructed here is accurate enough to make predictions consistent with the real data, but do not advise that it currently be used in place of the real data as a proxy.…”
Section: Discussionmentioning
confidence: 82%
“…Previous literature regarding simulation of survival data tends to focus on the creation of entirely new data under a range of set parameters [1,2,[31][32][33]. Many simulation studies make use of either exponential or Weibull distributions [34,35], however these are often not flexible enough to fully capture the shape of underlying hazard functions found in real-world clinicals trial or population based data, where at least one turning point is observed in the hazard function [21]. Making use of the flexibility in a Royston-Parmar model provides a good solution for replicating real-world survival data.…”
Section: Discussionmentioning
confidence: 99%
“…The process involves generating synthetic data from real data using a machine learning model that captures the patterns in real data and then generates new data from that model. The generated non-identifiable data closely match the statistical properties and patterns in the original dataset, offering very similar results and leading to the same conclusions, all while preserving individuals’ privacy and without the legislative need for additional consent [ 4 , 5 ]. This method is further discussed in detail in the next section.…”
Section: Privacy and Confidentiality-preserving Solutions For Geolocationmentioning
confidence: 80%
“…This is known as synthetic data and it is created using machine learning generative models. 204 It produces entirely new datasets as a proxy from the collected data, which are then tested and validated.…”
Section: Privacymentioning
confidence: 99%