Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Hu, Wei; Zaveri, Amrapali; Qiu, Honglei; Dumontier, Michel

doi:10.1186/s12859-017-1832-4

Cited by 11 publications

(14 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Eligibility criteria are stored as semi-structured text; they are recommended to be formatted as a bulleted list of individual criteria, but nearly 49% of values fail to parse according to the expected format. In NCBI's BioSample repository and Gene Expression Omnibus (GEO), and in EBI's BioSamples repository, the two issues most impeding data reuse are non-standardized field names and malformed values that failed to conform to the expected type for their field 41,42 . Apart from minor irregularities in some fields with enumerated values, ClinicalTrials.gov metadata were entirely free from these issues.…”

Section: Discussionmentioning

confidence: 99%

“…ClinicalTrials.gov records, like metadata records from other widely used biomedical data repositories 41 , 42 , are plagued by quality issues. Several studies have analyzed ClinicalTrials.gov records for missing fields required by the Food and Drug Administration Amendments Act of 2007, which governs US trial registries, and the World Health Organization (WHO) minimum data set, which provides guidelines for registries internationally 43 – 46 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Obstacles to the reuse of study metadata in ClinicalTrials.gov

2020

View full text Add to dashboard Cite

Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semi-structured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Obstacles to the reuse of study metadata in ClinicalTrials.gov

2020

View full text Add to dashboard Cite

show abstract

“…Several empirical studies have shown the need for better practices in curating scientific data [2][3][4][5]. Community efforts to improve metadata quality include various minimum metadata standards such as Minimum Information about a Next-Generation Sequencing Experiment (MINSEQE) [6] or broader principles such as the FAIR guidelines.…”

Section: Introductionmentioning

confidence: 99%

Ten simple rules for annotating sequencing experiments

et al. 2020

View full text Add to dashboard Cite

“…ClinicalTrials.gov records, like metadata records from other widely used biomedical data repositories, 41,42 are plagued by quality issues. Several studies have analyzed ClinicalTrials.gov records for missing fields required by the Food and Drug Administration Amendments Act of 2007, which governs US trial registries, and the World Health Organization (WHO) minimum data set, which provides guidelines for registries internationally [43][44][45][46] .…”

Section: Introductionmentioning

confidence: 99%

Obstacles to the reuse of study metadata in ClinicalTrials.gov

Miron

Gonçalves

Musen

2019

Preprint

View full text Add to dashboard Cite

Objective: ClinicalTrials.gov is a registry of clinical-trial metadata whose use is required by many funding agencies and scientific publishers. Metadata are essential to the reuse of data, but issues such as heterogenous metadata schemas, inconsistent values, and usage of free text instead of controlled terms pervade many metadata repositories. Our objective is to evaluate the quality of metadata about clinical studies in ClinicalTrials.gov and to document strategies to improve metadata accuracy. Methods: Using 302,091 metadata records, we evaluated whether values adhere to type expectations for Boolean, integer, date, age, and value-set fields, and whether records contain fields required by the Food and Drug Administration. We tested whether values for condition and intervention use terms from biomedical ontologies, and whether values for eligibility criteria follow the recommended format. Results: For simple fields, records contain correctly typed values, but there are anomalies in value-set fields. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontology terms, and almost half of the values for condition are not from MeSH, as recommended. Eligibility criteria are stored as unstructured free text. Conclusions: ClinicalTrials.gov's data-entry system enforces a schema with type restrictions, freeing records from common issues in metadata repositories. However, lack of ontology restrictions or structure for the condition, intervention, and eligibility criteria elements significantly impairs reusability. Searchability of the database depends on infrastructure that maps free-text values to terms from UMLS ontologies.Metadata are the lifeblood of biomedical data. At the simplest level, metadata are data that describe other data. In practice, we expect metadata to be structured and standardized, and to be useful in making the underlying data findable and reusable. High-quality metadata enhance scientific reproducibility and transparency, allow researchers to pool studies to increase the statistical power of inferences,[1] and enable the use of "big data" machine learning techniques. International metadata repositories such as the National Center for Biotechnology Information's (NCBI) BioSample and the European Bioinformatics Institute's (EBI) BioSamples repositories encourage data reuse through the availability of comprehensive metadata. They each gather metadata from several different repositories of biological data into a centralized, searchable database. Ideally, they also ensure that metadata follow unified standards and schema regardless of the author, source, and format of the original data.Unfortunately, biomedical metadata are plagued by numerous quality issues. Hu et al. examined the quality of the metadata that accompany data records in the Gene Expression Omnibus (GEO) and found that they suffered from type inconsistency (e.g., numerical fields populated with non-numerica...

show abstract

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Cited by 11 publications

References 16 publications

Obstacles to the reuse of study metadata in ClinicalTrials.gov

Obstacles to the reuse of study metadata in ClinicalTrials.gov

Ten simple rules for annotating sequencing experiments

Obstacles to the reuse of study metadata in ClinicalTrials.gov

Contact Info

Product

Resources

About