2017
DOI: 10.1186/s12859-017-1832-4
|View full text |Cite
|
Sign up to set email alerts
|

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Abstract: BackgroundThe ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GE… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(14 citation statements)
references
References 16 publications
0
14
0
Order By: Relevance
“…Eligibility criteria are stored as semi-structured text; they are recommended to be formatted as a bulleted list of individual criteria, but nearly 49% of values fail to parse according to the expected format. In NCBI's BioSample repository and Gene Expression Omnibus (GEO), and in EBI's BioSamples repository, the two issues most impeding data reuse are non-standardized field names and malformed values that failed to conform to the expected type for their field 41,42 . Apart from minor irregularities in some fields with enumerated values, ClinicalTrials.gov metadata were entirely free from these issues.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Eligibility criteria are stored as semi-structured text; they are recommended to be formatted as a bulleted list of individual criteria, but nearly 49% of values fail to parse according to the expected format. In NCBI's BioSample repository and Gene Expression Omnibus (GEO), and in EBI's BioSamples repository, the two issues most impeding data reuse are non-standardized field names and malformed values that failed to conform to the expected type for their field 41,42 . Apart from minor irregularities in some fields with enumerated values, ClinicalTrials.gov metadata were entirely free from these issues.…”
Section: Discussionmentioning
confidence: 99%
“…ClinicalTrials.gov records, like metadata records from other widely used biomedical data repositories 41 , 42 , are plagued by quality issues. Several studies have analyzed ClinicalTrials.gov records for missing fields required by the Food and Drug Administration Amendments Act of 2007, which governs US trial registries, and the World Health Organization (WHO) minimum data set, which provides guidelines for registries internationally 43 – 46 .…”
Section: Introductionmentioning
confidence: 99%
“…Several empirical studies have shown the need for better practices in curating scientific data [2][3][4][5]. Community efforts to improve metadata quality include various minimum metadata standards such as Minimum Information about a Next-Generation Sequencing Experiment (MINSEQE) [6] or broader principles such as the FAIR guidelines.…”
Section: Introductionmentioning
confidence: 99%
“…ClinicalTrials.gov records, like metadata records from other widely used biomedical data repositories, 41,42 are plagued by quality issues. Several studies have analyzed ClinicalTrials.gov records for missing fields required by the Food and Drug Administration Amendments Act of 2007, which governs US trial registries, and the World Health Organization (WHO) minimum data set, which provides guidelines for registries internationally [43][44][45][46] .…”
Section: Introductionmentioning
confidence: 99%