Chen Sun scite author profile

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCRfree protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.

show abstract

The complete chloroplast genome provides insight into the evolution and polymorphism of Panax ginseng

Zhao

et al. 2015

View full text Add to dashboard Cite

Panax ginseng C.A. Meyer (P. ginseng) is an important medicinal plant and is often used in traditional Chinese medicine. With next generation sequencing (NGS) technology, we determined the complete chloroplast genome sequences for four Chinese P. ginseng strains, which are Damaya (DMY), Ermaya (EMY), Gaolishen (GLS), and Yeshanshen (YSS). The total chloroplast genome sequence length for DMY, EMY, and GLS was 156,354 bp, while that for YSS was 156,355 bp. Comparative genomic analysis of the chloroplast genome sequences indicate that gene content, GC content, and gene order in DMY are quite similar to its relative species, and nucleotide sequence diversity of inverted repeat region (IR) is lower than that of its counterparts, large single copy region (LSC) and small single copy region (SSC). A comparison among these four P. ginseng strains revealed that the chloroplast genome sequences of DMY, EMY, and GLS were identical and YSS had a 1-bp insertion at base 5472. To further study the heterogeneity in chloroplast genome during domestication, high-resolution reads were mapped to the genome sequences to investigate the differences at the minor allele level; 208 minor allele sites with minor allele frequencies (MAF) of ≥0.05 were identified. The polymorphism site numbers per kb of chloroplast genome sequence for DMY, EMY, GLS, and YSS were 0.74, 0.59, 0.97, and 1.23, respectively. All the minor allele sites located in LSC and IR regions, and the four strains showed the same variation types (substitution base or indel) at all identified polymorphism sites. Comparison results of heterogeneity in the chloroplast genome sequences showed that the minor allele sites on the chloroplast genome were undergoing purifying selection to adapt to changing environment during domestication process. A study of P. ginseng chloroplast genome with particular focus on minor allele sites would aid in investigating the dynamics on the chloroplast genomes and different P. ginseng strains typing.

show abstract

Dissecting pattern unlock: The effect of pattern strength meter on pattern selection

Sun

Wang

Zheng

2014

Journal of Information Security and Applications

View full text Add to dashboard Cite

Quantitative Security Risk Assessment of Android Permissions and Applications

Wang

Zheng

Sun

et al. 2013

View full text Add to dashboard Cite

Abstract. The booming of the Android platform in recent years has attracted the attention of malware developers. However, the permissionsbased model used in Android system to prevent the spread of malware, has shown to be ineffective. In this paper, we propose DroidRisk, a framework for quantitative security risk assessment of both Android permissions and applications (apps) based on permission request patterns from benign apps and malware, which aims to improve the efficiency of Android permission system. Two data sets with 27,274 benign apps from Google Play and 1,260 Android malware samples were used to evaluate the effectiveness of DroidRisk. The results demonstrate that DroidRisk can generate more reliable risk signal for warning the potential malicious activities compared with existing methods. We show that DroidRisk can also be used to alleviate the overprivilege problem and improve the user attention to the risks of Android permissions and apps.

show abstract

AllSome Sequence Bloom Trees

Sun

Harris

Chikhi

et al. 2016

Preprint

View full text Add to dashboard Cite

The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39 -85%. Notably, it can query a batch of 198,074 queries in under 8 hours (compared to around two days previously) and a whole set of k-mers from a sequencing experiment (about 27 mil k-mers) in under 11 minutes.

show abstract

AllSome Sequence Bloom Trees

Sun

Harris

Chikhi

et al. 2017

View full text Add to dashboard Cite

Abstract. The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39 -85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 hours (compared to around two days previously) and a whole set of k-mers from a sequencing experiment (about 27 mil k-mers) in under 11 minutes.

show abstract

VarMatch: robust matching of small variant datasets using flexible scoring schemes

Sun

Medvedev

2016

View full text Add to dashboard Cite

show abstract

Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics

Sun

Medvedev

2017

Preprint

View full text Add to dashboard Cite

MotivationGenotyping a set of variants from a database is an important step for identifying known genetic traits and disease related variants within an individual. The growing size of variant databases as well as the high depth of sequencing data pose an efficiency challenge. In clinical applications, where time is crucial, alignment-based methods are often not fast enough. To fill the gap, Shajii et al. (2016) propose LAVA, an alignment-free genotyping method which is able to more quickly genotype SNPs; however, there remains large room for improvements in running time and accuracy.ResultsWe present the VarGeno method for SNP genotyping from lllumina whole genome sequencing data. VarGeno builds upon LAVA by improving the speed of k-mer querying as well as the accuracy of the genotyping strategy. We evaluate VarGeno on several read datasets using different genotyping SNP lists. VarGeno performs 7-13 times faster than LAVA with similar memory usage, while improving accuracy.AvailabilityVarGeno is freely available at: https://github.com/medvedevgroup/vargeno.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chen Sun

Accurate typing of short tandem repeats from genome-wide sequencing data and its applications

The complete chloroplast genome provides insight into the evolution and polymorphism of Panax ginseng

Dissecting pattern unlock: The effect of pattern strength meter on pattern selection

Quantitative Security Risk Assessment of Android Permissions and Applications

AllSome Sequence Bloom Trees

AllSome Sequence Bloom Trees

VarMatch: robust matching of small variant datasets using flexible scoring schemes

Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics

Contact Info

Product

Resources

About