Mingfu Shao scite author profile

We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.

show abstract

An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes

Shao

Lin

Moret

2015

Journal of Computational Biology

View full text Add to dashboard Cite

Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.

show abstract

Theory and A Heuristic for the Minimum Path Flow Decomposition Problem

Shao

Kingsford

2019

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: given a directed acyclic graph with source and sink and a flow , compute a set of paths and assign weight for such that , and is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values. This result gives rise to the framework of our heuristic: to iteratively reduce the gap through identifying such equations. We also define an operation on certain independent substructures of the graph, and prove that this operation does not affect the optimality but can transform the graph into one with desired property that facilitates reducing the gap. We apply and test our algorithm on both simulated random instances and perfect splice graph instances, and also compare it with the existing state-of-art algorithm for flow decomposition. The results illustrate that our algorithm can achieve very high accuracy on these instances, and also that our algorithm significantly improves on the previous algorithms. An implementation of our algorithm is freely available at https://github.com/Kingsford-Group/catfish.

show abstract

SQUID: Transcriptomic Structural Variation Detection from RNA-seq

Shao

Kingsford

2017

Preprint

View full text Add to dashboard Cite

Transcripts are frequently modified by structural variations, which leads to a fused transcript of either multiple genes (known as a fusion gene) or a gene and a previously non-transcribing sequence. Detecting these modifications (called transcriptomic structural variations, or TSVs), especially in cancer tumor sequencing, is an important and challenging computational problem. We introduce SQUID, a novel algorithm to accurately predict both fusion-gene and non-fusion-gene TSVs from RNA-seq alignments. SQUID unifies both concordant and discordant read alignments into one model, and doubles the accuracy on simulation data compared to other approaches. With SQUID, we identified novel non-fusion-gene TSVs on TCGA samples.

show abstract

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Shao

Lin

Moret

2014

View full text Add to dashboard Cite

The SIB Swiss Institute of Bioinformatics’ resources: focus on curated databases

Bultet¹,

Aguilar‐Rodríguez²,

Ahrens³

et al. 2015

Nucleic Acids Res

View full text Add to dashboard Cite

The SIB Swiss Institute of Bioinformatics (www.isb-sib.ch) provides world-class bioinformatics databases, software tools, services and training to the international life science community in academia and industry. These solutions allow life scientists to turn the exponentially growing amount of data into knowledge. Here, we provide an overview of SIB's resources and competence areas, with a strong focus on curated databases and SIB's most popular and widely used resources. In particular, SIB's Bioinformatics resource portal ExPASy features over 150 resources, including UniProtKB/Swiss-Prot, ENZYME, PROSITE, neXtProt, STRING, UniCarbKB, SugarBindDB, SwissRegulon, EPD, arrayMap, Bgee, SWISS-MODEL Repository, OMA, OrthoDB and other databases, which are briefly described in this article.

show abstract

Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion

Shao

Lin

2012

BMC Bioinformatics

View full text Add to dashboard Cite

Computing the edit distance between two genomes under certain operations is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be easily computed for genomes without duplicate genes. In this paper, we study the edit distance for genomes with duplicate genes under a model that includes DCJ operations, insertions and deletions. We prove that computing the edit distance is equivalent to finding the optimal cycle decomposition of the corresponding adjacency graph, and give an approximation algorithm with an approximation ratio of 1.5 + ∈.

show abstract

Context-aware seeds for read mapping

Xin

Shao

Kingsford

2020

Algorithms Mol Biol

View full text Add to dashboard Cite

Motivation: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows, this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results: We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS significantly reduces seed frequencies when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mingfu Shao

Accurate assembly of transcripts through phase-preserving graph decomposition

An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes

Theory and A Heuristic for the Minimum Path Flow Decomposition Problem

SQUID: Transcriptomic Structural Variation Detection from RNA-seq

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

The SIB Swiss Institute of Bioinformatics’ resources: focus on curated databases

Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion

Context-aware seeds for read mapping

Contact Info

Product

Resources

About