orfipy: a fast and flexible tool for extracting ORFs

Singh, Urminder; Wurtele, Eve Syrkin

doi:10.1101/2020.10.20.348052

Cited by 4 publications

(10 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Long ORFs are often used, along with other evidence, to initially identify candidate protein-coding regions or functional RNA-coding regions in a given DNA sequence, but the presence of an ORF does not necessarily mean that the region is always translated [24]. As BLAST and BLAT, the web-based ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/), ORF Predictor (http://bioinformatics.ysu.edu/tools/OrfPredictor.html) and command-line tools (ORF Investigator [25] and orfipy [13]) offer a range of ORF searches, but its usage can be challenging for biologists due to lack of computer programming literacy and limited query sequence length. To maximise the flexibility, the easyfm ORF provides a fast and efficient approach for all possible translation and extraction of ORFs from nucleotide sequences (FASTA format of nucleotide and protein output from six-frame translation) (Fig 4).…”

Section: Resultsmentioning

confidence: 99%

“…Conveniently, FASTQ files can also be converted to FASTA files, the most commonly used file format for NGS data that enables direct sequencing of target genes. Many available tools (easySEARCH [10]; BlasterJS [11]; Sequenceserver [12]; orfipy [13]); Samtools and BCFtools [14] including easyfm ) have not surprisingly focused on manipulating (analyse, collect, organise, interpret, and present data in meaningful ways) the FASTA file format to generate biologically relevant insights.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

easyfm: An easy software suite for file manipulation of Next Generation Sequencing data on desktops

Jung

Jeon

Ortiz‐Barrientos

2021

Preprint

View full text Add to dashboard Cite

Storing and manipulating Next Generation Sequencing (NGS) file formats is an essential but difficult task in biological data analysis. The easyfm ( easy f ile m anipulation) toolkit ( https://github.com/TaekAndBrendan/easyfm ) makes manipulating commonly used NGS files more accessible to biologists. It enables them to perform end-to-end reproducible data analyses using a free standalone desktop application (available on Windows, Mac and Linux). Unlike existing tools (e.g. Galaxy), the Graphical User Interface (GUI)-based easyfm is not dependent on any high-performance computing (HPC) system and can be operated without an internet connection. This specific benefit allow easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

easyfm: An easy software suite for file manipulation of Next Generation Sequencing data on desktops

Jung

Jeon

Ortiz‐Barrientos

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Protein sequence used as evidence in MAKER 30 were generated in one of two ways: 1) For Arabidopsis, yeast, and rice, RNA-Seq reads were assembled using Trinity (v2.6.6) 65 , followed by open reading frame (ORF) prediction and translation using orfipy 29 or TransDecoder (v3.0.1) 65 . 2) For Arabidopsis only, data was downloaded from Phytozome 66 as predicted protein sequences for nine species: Arabidopsis thaliana, (Glycine max, Populus trichocarpa, Arabidopsis lyrata, Conradina grandiflora, Setaria italica, Oryza sativa, Physcomitrella patens, Chlamydomonas reinhardtii, and Brassica rapa).…”

Section: Rna-seq Genome and Protein Input Datamentioning

confidence: 99%

“…The BAM file generated by mapping reads to the Araport11-annotated indexed genome using HiSat2 (v2.1.0) 67 was provided as training for the assemblers. The resultant assembled transcripts were used to predict ORFs using Transdecoder 65 or orfipy 29 . We selected those complete ORFs over 150 nt.…”

Section: Evidence-based Annotation Of Genes By Direct Inferencementioning

confidence: 99%

“…This approach is key for predicting non-coding RNAs 28 and young genes 3 .However, it has been less widely adopted to annotate protein coding genes, in part because of the challenge of distinguishing "noise" from true genetic signal 26,27 . One approach to reduce "noise" and other false positive predictions is to combine direct inference of genes with sequence similarity 24,26 ; this approach excludes orphan genes 29 .…”

mentioning

confidence: 99%

See 1 more Smart Citation

Foster thy young: Enhanced prediction of orphan genes in assembled genomes

Singh

Bhandary

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

The evolutionary rapid emergence of new genes gives rise to "orphan genes" that share no sequence homology to genes in closely related genomes. These genes provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Gene annotation pipelines that combine ab initio machine-learning with sequence homology-based searches are efficient in identifying basal genes with a long evolutionary history. However, their ability to identify orphan genes and other young genes has not been systematically evaluated. Here, we classify the phylostrata of curated Arabidopsis thaliana genes and use these to assess the ability of two of the most prevalent annotation pipelines, MAKER and BRAKER, to predict orphans and other young genes. MAKER predictions are highly dependent on the RNA-Seq evidence, predicting between 11% and 60% of the orphan-genes and 95% to 98% of basal-genes in the annotated genome of Arabidopsis. In contrast, BRAKER consistently predicts 33% of orphan-genes and 98% of basal-genes. A less used method to identify genes is by directly aligning RNA-Seq data to the genome sequence. We present a Findable, Accessible, Interoperable and Reusable (FAIR) approach, called BIND, that mitigates the under-prediction of orphan genes. BIND combines BRAKER predictions with direct evidence-based inference of transcripts based on RNA-Seq alignments to the genome. BIND increases the number and accuracy of orphan gene predictions, identifying 68% of Araport11-annotated orphan genes and 99% of the conserved genes.

show abstract

A pan-tissue, pan-disease compendium of human orphan genes

Singh,

Haltom,

Guarnieri

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Species-specific genes are ubiquitous in evolution, with functions ranging from prey paralysis to survival in subzero temperatures. Because they are typically expressed under limited conditions and lack canonical features, such genes may be vastly under-identified, even in humans. Here, we leverage terabytes of human RNA-Seq data to identify thousands of highly-expressed transcripts that do not correspond to any Gencode-annotated gene. Many may be novel ncRNAs although 80% of them contain ORFs that have the potential of encoding proteins unique toHomo sapiens(orphan genes). We validate our findings with independent strand-specific and single-cell RNA-seq datasets. Hundreds of these novel transcripts overlap with deleterious genomic variants; thousands show significant association with disease-specific patient survival. Most are dynamically regulated and accumulate selectively in particular tissues, cell-types, developmental stages, tumors, COVID-19, sex, and ancestries. As such, these transcripts hold potential as diagnostic biomarkers or therapeutic targets. To empower future discovery, we provide a compendium of these huge RNA-Seq expression data, and RiboSeq data, with associated metadata. Further, we supply the gene models for the novel genes as UCSC Genome Browser tracks.

show abstract

orfipy: a fast and flexible tool for extracting ORFs

Cited by 4 publications

References 13 publications

easyfm: An easy software suite for file manipulation of Next Generation Sequencing data on desktops

easyfm: An easy software suite for file manipulation of Next Generation Sequencing data on desktops

Foster thy young: Enhanced prediction of orphan genes in assembled genomes

A pan-tissue, pan-disease compendium of human orphan genes

Contact Info

Product

Resources

About