Human untargeted metabolomics studies annotate only ~10% of molecular features. We introduce reference-data-driven analysis to match metabolomics tandem mass spectrometry (MS/MS) data against metadata-annotated source data as a pseudo-MS/MS reference library. Applying this approach to food source data, we show that it increases MS/MS spectral usage 5.1-fold over conventional structural MS/MS library matches and allows empirical assessment of dietary patterns from untargeted data.Complex sequence data from metagenomic (see Box 1 for definition of terms) or metatranscriptomic experiments require for interpretation both databases of curated genes and reference data, such as whole genomes or other sequence data with carefully curated metadata (developmental stage, tissue location, phenotype, etc.) [1][2][3][4] . Such reference data-driven (RDD) analysis increases understanding of complex communities by using matches between genes or transcripts of known and unknown origin. The RDD strategy is essential for the successful analysis of most metatranscriptomics or metagenomics data. By analogy, interpreting liquid chromatography-tandem mass spectromtery (LC-MS/MS)-based untargeted metabolomics data is performed by searching structural MS/MS libraries. However, leveraging reference data with curated and structured controlled vocabulary metadata to improve insights obtainable from untargeted MS/MS-based metabolomics is not yet done.RDD analysis uses not only annotated MS/MS-spectra but also all unannotated spectra. The gas chromatography-mass spectrometry (GC-MS) BinBase resource has made a step in the direction of RDD. With BinBase one can annotate if a spectrum match has been observed in a non-public GC-MS dataset. However, the metadata is not well controlled and lacks the ability to add contextualized metadata 5,6 . In addition, as we have previously demonstrated, using structural annotations, the source can be determined by literature mining 7 . However, owing to the above mentioned limitations and/ or inability to link related spectra in the case of metabolism, the above strategies to annotate unknowns cannot be used to systematically to interpret the source information at the dataset level. We therefore introduce the RDD approach for metabolomics (Fig. 1), followed by a use case demonstrating empirical food readouts from untargeted human data (Fig. 2).Untargeted MS/MS-based metabolomics experiments involve searching MS/MS structural libraries since the late 1970's 8,9 , or, more recently, for investigating the distribution of a MS/MS spectrum across public untargeted data 10 . Instead of only leveraging a single MS/MS spectrum to obtain an annotation, RDD metabolomics uses all MS/MS spectra from untargeted metabolomics files, which con-