11Plants are valuable resources for a variety of products in modern societies. Plant species 12 identification is an integral part of research and practical application on plants. In parallel with 13 high-throughput sequencing technology, the high-throughput screening of species is in high 14 demand. Highly accurate and efficient DNA-based marker identification is essential for the 15 effective analysis of plant species or biological constituents of a mixture of plants as well. 16Therefore, it is of general interests and significance to generate a comprehensive and 17 accurate DNA-based marker sequence resource, as well as to build efficient sequence search 18 engines, for the accurate and fast identification of plant species. 19
20In this work, we have firstly established a high-quality ITS2 sequence database of plant 21 species containing more than 150,000 entries, through the systematical collection and 22 manually collation of the published ITS2 sequencing data of plant species, data quality control, 23 as well as representative sequence refinement based on clustering method. Secondly, an 24 accurate and efficient plant species identification system based on ITS2 sequence was 25 constructed, which is the proper combination of sequence search algorithms including BLAST 26 and Kraken. Through the deployment of high-performance and frequently updated web service, 27 it's expected to serve for a wide range of researchers involving the taxonomy classification of 28 plant species, as well as for deciphering of plant mixed systems including herbal materials in 29 TCM preparations. 30
31The Holmes-ITS2 web service is freely accessible at: http://its2.tcm.microbioinformatics.org/. 32The input of this web service could be multiple sequences in a single fasta format, to search 33 for matching ITS2 biomarker sequences already annotated in the database. This 34 sequence-based search is based on two engines: BLAST, and k-mer based Kraken. 35Alternatively, users can directly search for species name for the corresponding ITS2 biomarker 36 sequences. The web service has been put to the test by more than 50 experts from China, 37Denmark and US, and the average running time for the search ranges from 3-30 seconds for 38 up to 100 sequences as a batch query. 39 40 peer-reviewed)