8We present Nubeam (nucleotide be a matrix) as a novel reference-free approach to 9 analyze short sequencing reads. Nubeam represents nucleotides by matrices, trans-10 forms a read into a product of matrices, and based on which assigns numbers to reads.
11Nubeam capitalizes on the non-commutative property of matrix multiplication, such 12 that different reads are assigned different numbers, and similar reads similar numbers.
13A sample, which is a collection of reads, becomes a collection of numbers that form an 14 empirical distribution. We demonstrate that the genetic difference between samples 15 can be quantified by the distance between empirical distributions. Nubeam can ac-16 count for GC bias and nucleotide quality, and is computationally efficient; the K-mer 17 method is a special case of Nubeam, but without those benefits. As a reference-18 free approach, Nubeam avoids reference bias and mapping bias and can work with 19 organisms without reference genomes. Thus, Nubeam is ideal to analyze datasets 20 from metagenomic whole-genome sequencing, where the amount of unmapped reads 21 is substantial. When applied to human microbiome sequencing, Nubeam recapit-22 ulated findings made by mapping-based methods, and shed lights on contributions 23 of unmapped reads. In particular, body habitats dominate clustering of unmapped 24 pseudo-samples; there are more outliers in skin whole samples than the skin mapped 25 pseudo-samples; and analysis of unmapped reads suggested that the sequencing depth 26 is far from sufficient for urogenital samples.27 Introduction 29When identifying variants is not a must and the primary interest is to quantify genetic 30 differences between samples (Ravel et al., 2011; Nayfach and Pollard, 2016), it can be ben-31 eficial to analyze short sequencing reads without reference genomes. First, it avoids ref-32 erence bias and mapping bias. Both biases can be alleviated but never overcome because 33 they are intrinsic to the mapping based approach. Second, it avoids uncertainty related 34 to variants-call, particularly when the sequencing coverage is low. Third, it becomes pos-35 sible to analyze organisms that have no reference genomes, or the reference genomes are 36 incomplete or in low quality.
37The prominent reference-free approach is the K-mer method (Jiang et al., 2012; Sub-38 ramanian and Schwartz, 2015; Lu et al., 2017). Simply put, the K-mer method calculates 39 frequencies of each K-mer (K consecutive nucleotides) presented in all reads from a sam-40 ple, and infer differences between samples by comparing K-mer frequencies. In practice, 41 however, the K-mer method has several difficulties. First, it implicitly assumes error-free 42 in reads, and it is difficult-if not impossible-to account for nucleotide quality (Comin 43 et al., 2015; Comin and Schimd, 2016). Second, choosing K can be a headache-too small 44 or too large of K will make the K-mer frequencies less informative. Third, some pairs of 45 K-mers only differ by one nucleotide and other pairs of K-mers differ by K...