words)Background Nanopore sequencing is a powerful single 1 molecule DNA sequencing technology which provides a 2 high throughput and long sequence reads. Nevertheless, 3 its relatively high native error rate limits the direct detec-4 tion of point mutations in individual reads of amplicon li-5 braries, as these mutations are difficult to distinguish from 6 the sequencing noise. 7 Results We propose a computational method to reduce 8 noise in nanopore detection of point variations. Our ap-9 proach uses the fact that all reads are expected to be 10 very similar to a wild type sequence, for which we ex-11 perimentally characterize the position-specific systematic 12 sequencing error pattern. We then use this information to 13 reweight, in individual reads from the variant library, the 14 confidence given to nucleotides read that do not match the 15 wild type. We tested this method on two sets of known 16 variants of Klen Taq, where the true mutation rate was 17 3.3 mutations per kb, well below the sequencing noise. We 18 observed that the actual mutations became more distin-19 guishable from sequencing noise after correction. This ap-20 proach can be used, for example to help the clustering of 21 variants, or to decrease the number of reads necessary to 22 call a consensus.
23Conclusions The computational method is simple to im-24 plement and requires only a few thousands reads of the 25 wild type sequence of interest, which can be easily ob-26 tained by multiplexing in a single minION run. The ap-27 proach does not require any modification in the experimen-28 tal protocol for sequencing and can be simply implemented 29 downstream standard base calling.30 Keywords 31 minION, nanopore sequencing, next generation sequenc-32 ing, amplicons, SNP detection, logistic regression.33In this paper, we propose a computational protocol to im-74 prove variant detection in individual reads from libraries 75 for which a reference gene is known, using standard 1D 76 protocol minION sequencing. We base our method on two 77 observations made during the sequencing of many (identi-78 cal) copies of the parent sequence. First, the confidence or 79 quality scores (Q score ) assigned by the base calling process 80 to each nucleotide are usually low when a wrong nucleotide 81 is assigned (Suppl. figure S2), as expected. Second, the er-82 rors are not homogeneously distributed, and they are more 83 frequent in some positions of the DNA (Suppl. figure S1).
84These observations suggest that it should be possible to re-85 duce the non-random part of the sequencing errors, using 86 the information contained in the (Q score ). The method we 87 propose has two steps: the first one uses the reference reads 88 to build a statistical model of the error pattern. Here we 89 used a position and nucleotide-specific logistic regression.
90In the second step, this information is used to re-analyze 91 minION base calls for the variant library and to update the 92 confidence value of each nucleotide read in this dataset. 93 We tested our method us...