NormAE: Deep Adversarial Learning Model to Remove Batch Effects in Liquid Chromatography Mass Spectrometry-Based Metabolomics Data

Rong, Zhiwei; Tan, Qilong; Cao, Lei; Zhang, Liuchao; Deng, Kui; Huang, Yue; Zhu, Zheng‐Jiang; Li, Zhenzi; Li, Kang

doi:10.1021/acs.analchem.9b05460

Cited by 42 publications

(74 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given there were six batches, the between-batch detection proportions meant we required a feature to be initially detected in at least 1, 2, 3, 4, 5, or 6 batches, respectively. For the traditional apLCMS procedure, we set the detection threshold (number of samples) at 30 XCMS IPO_4: CentWave parameters: same as XCMS IPO_3; peak grouping parameters: bw = 22, mzwid = 0.018; Loess parameters: missing = 1, extra = 3, span = 0.2, smooth = "loess", family = "gaussian".…”

Section: Resultsmentioning

confidence: 99%

Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing

Liu

Walker

Uppal

et al. 2020

Sci Rep

View full text Add to dashboard Cite

With the growth of metabolomics research, more and more studies are conducted on large numbers of samples. Due to technical limitations of the Liquid Chromatography-Mass Spectrometry (LC/MS) platform, samples often need to be processed in multiple batches. Across different batches, we often observe differences in data characteristics. In this work, we specifically focus on data generated in multiple batches on the same LC/MS machinery. Traditional preprocessing methods treat all samples as a single group. Such practice can result in errors in the alignment of peaks, which cannot be corrected by post hoc application of batch effect correction methods. In this work, we developed a new approach that address the batch effect issue in the preprocessing stage, resulting in better peak detection, alignment and quantification. It can be combined with downstream batch effect correction methods to further correct for between-batch intensity differences. The method is implemented in the existing workflow of the apLCMS platform. Analyzing data with multiple batches, both generated from standardized quality control (QC) plasma samples and from real biological studies, the new method resulted in feature tables with better consistency, as well as better downstream analysis results. The method can be a useful addition to the tools available for large studies involving multiple batches. The method is available as part of the apLCMS package. Download link and instructions are at https ://mypag e.cuhk.edu.cn/acade mics/yutia nwei/apLCM S/. Metabolomics using liquid chromatography-mass spectrometry (LC/MS) is widely used in identifying disease biomarkers, finding drug targets and unravelling complex biological networks. A high-resolution LC/MS profile from a complex biological sample contains thousands of features, and different LC/MS platforms yield profiles of different characteristics. There are a number of computational pipelines that conduct the necessary steps to preprocess LC/MS data, including peak detection, peak quantification, retention time (RT) correction, feature alignment, and weak signal recovery 1-13. Some methods provide utilities to group features that are potentially derived from the same metabolite 14-17. Other data servers and packages are available to annotate features to known metabolites based on m/z and RT information 18-21. When the sample size is large, it is often necessary for the samples to be processed in batches. Across the batches, even if the data are generated from the same machine, we often observe different data characteristics. Using traditional data preprocessing approaches, we either treat all the samples as a single batch, or preprocess different batch individually, and then seek to merge the feature tables. As we discuss in the following, both of the approaches have some issues. If we treat all samples as a single batch, the between-batch data characteristic changes will be considered as random noise. More lenient thresholds have to be used in feature alignment and weak signal recovery, in ...

show abstract

Section: Resultsmentioning

confidence: 99%

Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing

Liu

Walker

Uppal

et al. 2020

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Recently, nonlinear models, often based on neural autoencoders, have gained popularity (e.g. normAE [40], AD-AE [10], or scGEN [32]). Most models aim to find a batch-effect-free latent space representation of the data via adversarial training.…”

Section: Related Workmentioning

confidence: 99%

“…Additionally, desired biological variation (referred to in this paper as " (experimental) design") between different independent experiments needs be conserved in any algorithm which aims to remove the batch effects. Although a range of batch correction algorithms has previously been suggested [46,28,40,8], only a small subset of these remains applicable in this large-scale setting. In particular, most previous algorithms cannot incorporate high-dimensional experimental design information.…”

Section: Introductionmentioning

confidence: 99%

reComBat: batch-effect removal in large-scale multi-source gene-expression data integration

Adamer

Brueningk

Tejada-Arranz

et al. 2021

Preprint

View full text Add to dashboard Cite

With the steadily increasing abundance of omics data produced all over the world, some-times decades apart and under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch effect removal for entire databases lies in the large number and coincide of both batches and desired, biological variation resulting in design matrix singularity. This problem currently cannot be solved by any common batch correction algorithm. In this study, we present reComBat, a regularised version of the empirical Bayes method to overcome this limitation. We demonstrate our approach for the harmonisation of public gene expression data of the human opportunistic pathogen Pseudomonas aeruginosa and study a several metrics to empirically demonstrate that batch effects are successfully mitigated while biologically meaningful gene expression variation is retained. reComBat fills the gap in batch correction approaches applicable to large scale, public omics databases and opens up new avenues for data driven analysis of complex biological processes beyond the scope of a single study.

show abstract

“…With the evolution of deep network architecture, there are more and more breakthroughs in GANs 14 in the past three years. A typical method applied to this field is the NormAE 15 developed by Rong et al in 2020. Its basic idea lies in constructing an adversarial training procedure between a nonlinear AE to remove batch effects and a discriminator to distinguish the source of domain based on the latent space.…”

Section: Introductionmentioning

confidence: 99%

“…b Comparison of classification accuracy with multiple source batches for training and only one target batch for testing. Note that "Recon_T" denotes an ablation experiment that reconstruct all target batches.Afterwards, we select several latest and most representative tools including ComBat 17 , NormAE15 , BERMUDA4 , and DESC 5 for further comparison. Accuracy of cross-batch prediction in the sample level is utilized to assess the effectiveness of each method.…”

mentioning

confidence: 99%

Deep learning based multi-batch calibration for classification in various omics

Wang

Niu

et al. 2021

Preprint

View full text Add to dashboard Cite

Background: The amount of available biological data has exploded since the emergence of high-throughput technologies, which is not only revolting the way we recognize molecules and diseases but also bringing novel analytical challenges to bioinformatics analysis. In the last decade, deep learning has become a dominant technique in data science. However, classification accuracy is plagued with domain discrepancy. Notably, in the presence of multiple batches, domain discrepancy typically happens between individual batches. The recently proposed pair-wise adaptation approach may be suboptimal as it fails to eliminate the external factors across multiple batches and takes the classification task into account simultaneously. Results: We propose a joint deep learning framework for integrating batch effect removal and classification upon various omics data. To this end, we validate it on two private metabolomics (MALDI MS) datasets and one public transcriptomics (scRNA-seq) dataset. Especially for the former, we have achieved the highest diagnostic accuracy (ACC), with notable ~10% improvement than over state-of-the-art methods. Overall, these results indicate that our approach removes batch effect more effectively than conventional methods and yields more accurate classification results for smart diagnosis.

show abstract

NormAE: Deep Adversarial Learning Model to Remove Batch Effects in Liquid Chromatography Mass Spectrometry-Based Metabolomics Data

Cited by 42 publications

References 33 publications

Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing

Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing

reComBat: batch-effect removal in large-scale multi-source gene-expression data integration

Deep learning based multi-batch calibration for classification in various omics

Contact Info

Product

Resources

About