A community effort to identify and correct mislabeled samples in proteogenomic studies

Yoo, Seungyeul; Shi, Zhiao; Wen, Bo; Kho, Soon Jye; Pan, Renke; Feng, Hanying; Chen, Hong; Carlsson, Anders; Edén, Patrik; Ma, Wanli; Raymer, Michael L.; Maier, Ezekiel J.; Težak, Živana; Johanson, Elaine; Hinton, Denise; Rodriguez, Henry; Zhu, Jun; Boja, Emily S.; Wang, Pei; Zhang, Bing

doi:10.1016/j.patter.2021.100245

Cited by 6 publications

(3 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, Figure 7c,d shows that the performance of PLS-DA on textiles was poor, and the robustness of XGBoost was better. Last, since manual labeling was used in this work, as shown in Figure 7b, mislabeling was inevitable 33 owing to limited experience and knowledge. In the last line of image Test3, there were some strings interspersed in the middle of the bamboo board that were inaccurately labeled as a type of wood.…”

Section: Testing Resultmentioning

confidence: 99%

Application of XGBoost for Fast Identification of Typical Industrial Organic Waste Samples with Near-Infrared Hyperspectral Imaging

Lan

et al. 2023

ACS EST Eng.

View full text Add to dashboard Cite

Waste material identification is an essential part of waste recycling and treatment. Hyperspectral imaging (HSI) enables fast, accurate, nondestructive, and non-invasive identification of waste materials. In this study, HSI-based classification of typical industrial organic waste that cannot be sorted via traditional methods has been explored, namely, leather, paper, plastic, rubber, textile, and wood. The extreme gradient boosting (XGBoost) algorithm, a supervised machine learning algorithm that has never been investigated for waste identification-related fields, was adopted. The results show that XGBoost obtained a higher pixelwise weighted average F1-score of 82.72% and a faster prediction time of 270 ms for the tested images compared with the commonly used partial least squares-discriminant analysis (77.83% and 444 ms). XGBoost was more effective and efficient in aiding HSI identification and classification of industrial organic waste. The technique can be a significant advancement in the development of an online sorting or identification platform, affording significant labor cost reduction, time savings, and the provision of a stable, accurate, and rapid method for waste intelligent identification.

show abstract

Section: Testing Resultmentioning

confidence: 99%

Application of XGBoost for Fast Identification of Typical Industrial Organic Waste Samples with Near-Infrared Hyperspectral Imaging

Lan

et al. 2023

ACS EST Eng.

View full text Add to dashboard Cite

show abstract

“…A recent study by Yoo et al. ( 26 ), for instance, reported a community effort to address sample mislabelling issues in proteogenomic and multi-omics studies, and found 7.5% and 3.5% mislabelled samples in two datasets. To our best knowledge, tissue heterogeneity has not been addressed on a large scale by such a community effort.…”

Section: Discussionmentioning

confidence: 99%

Tissue heterogeneity is prevalent in gene expression studies

Sturm

List

Zhang

2021

NAR Genomics and Bioinformatics

View full text Add to dashboard Cite

Lack of reproducibility in gene expression studies is a serious issue being actively addressed by the biomedical research community. Besides established factors such as batch effects and incorrect sample annotations, we recently reported tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, as a source of variance. Although tissue heterogeneity exacerbates irreproducibility, its prevalence in gene expression data remains unknown. Here, we systematically analyse 2 667 publicly available gene expression datasets covering 76 576 samples. Using two independent data compendia and a reproducible, open-source software pipeline, we find a prevalence of tissue heterogeneity in gene expression data that affects between 1 and 40% of the samples, depending on the tissue type. We discover both cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, and cases of moderate heterogeneity, which are likely caused by tissue infiltration or sample contamination. Our analysis establishes tissue heterogeneity as a widespread phenomenon in publicly available gene expression datasets, which constitutes an important source of variance that should not be ignored. Consequently, we advocate the application of quality-control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.

show abstract

“…Therefore, it is essential to integrate real-life data from communities with complementary technical strengths and complex performances. A paradigm model is the crowdsourced precisionFDA challenges, which leverages the power of community participants to identify the QC tools with high accuracy and robustness 17 , and to upgrade benchmarks for easy- and difficult-to-map genomics regions 18 , etc. This exemplary model deserves to be extended to more dimensions with other types of omic studies to help researchers gain the knowledge and resources to ensure data quality and thus improve the reliability of omics-based biological discoveries.…”

mentioning

confidence: 99%

The Quartet Data Portal: integration of community-wide resources for multiomics quality control

Yang

Liu

Shang

et al. 2022

Preprint

View full text Add to dashboard Cite

The implementation of quality control for multiomic data requires the widespread use of well-characterized reference materials, reference datasets, and related resources. The Quartet Data Portal was built to facilitate community access to such rich resources established in the Quartet Project. A convenient platform is provided for users to request the DNA, RNA, protein, and metabolite reference materials, as well as multi-level datasets generated across omics, platforms, labs, protocols, and batches. Interactive visualization tools are offered to assist users to gain a quick understanding of the reference datasets. Crucially, the Quartet Data Portal continuously collects, evaluates, and integrates the community-generated data of the distributed Quartet multiomic reference materials. In addition, the portal provides analysis pipelines to assess the quality of user-submitted multiomic data. Furthermore, the reference datasets, performance metrics, and analysis pipelines will be improved through periodic review and integration of multiomic data submitted by the community. Effective integration of the evolving technologies via active interactions with the community will help ensure the reliability of multiomics-based biological discoveries. The Quartet Data Portal is accessible at https://chinese-quartet.org.

show abstract

A community effort to identify and correct mislabeled samples in proteogenomic studies

Cited by 6 publications

References 33 publications

Application of XGBoost for Fast Identification of Typical Industrial Organic Waste Samples with Near-Infrared Hyperspectral Imaging

Application of XGBoost for Fast Identification of Typical Industrial Organic Waste Samples with Near-Infrared Hyperspectral Imaging

Tissue heterogeneity is prevalent in gene expression studies

The Quartet Data Portal: integration of community-wide resources for multiomics quality control

Contact Info

Product

Resources

About