Y2H-SCORES: A statistical framework to infer protein-protein interactions from next-generation yeast-two-hybrid sequence data

Velásquez-Zapata, Valeria; Elmore, James M.; Banerjee, Sagnik; Dorman, Karin S.; Wise, Roger P.

doi:10.1101/2020.09.08.288365

Cited by 5 publications

(2 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Genome annotation is the process of identifying transcriptionally active regions of the genome and defining gene structures. Decoding the correct structures of genes is essential since several downstream applications rely on accurate annotations: detecting interactions between proteins [6][7][8][9][10][11][12][13][14], identifying post-translational modifications [15][16][17][18][19][20][21][22][23], mining effectors [24][25][26][27][28], and determining protein structure [29][30][31][32]. Although we have seen a significant improvement in genome sequencing technology, annotation methods continue to underperform [33,34].…”

Section: Introductionmentioning

confidence: 99%

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

Banerjee

Bhandary

Woodhouse

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Background: Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative. Results: We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species. Conclusions: FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision - ideal for bench researchers with limited experience in handling computational tools.

show abstract

Section: Introductionmentioning

confidence: 99%

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

Banerjee

Bhandary

Woodhouse

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In plants, the prediction of protein-protein interactions (PPIs) provides important information for understanding the molecular mechanisms underlying biological processes. Recently, a large number of high-throughput experimental approaches have been developed to identified PPIs, such as affinity-purification coupled to mass spectrometry (AP-MS) [1] and yeast two-hybrid (Y2H) [2][3][4][5] screens methods. Although we have accumulated a large amount of plant PPIs data [6][7][8], these experimental approaches also some inevitable drawbacks, which are not only costly, but also laborious and time-consuming.…”

Section: Introductionmentioning

confidence: 99%

Computational Prediction of Protein-Protein Interactions in Plants Using Only Sequence Information

Pan

et al. 2021

Intelligent Computing Theories and Application

View full text Add to dashboard Cite

Protein-protein interactions (PPIs) in plants plays a significant role in plant biology and functional organization of cells. Although, a large amount of plant PPIs data have been generated by high-throughput techniques, but due to the complexity of plant cell, the PPIs pairs currently obtained by experimental methods cover only a small fraction of the complete plant PPIs network. In addition, the experimental approaches for identifying PPIs in plants are laborious, time-consuming, and costly. Hence, it is highly desirable to develop more efficient approaches to detect PPIs in plants. In this study, we present a novel computational model combining weighted sparse representation-based classifier (WSRC) with a novel inverse fast Fourier transform (IFFT) representation scheme which was adopted in position specific scoring matrix (PSSM) to extract features from plant protein sequence. When performed the proposed method on the plants PPIs dataset of Mazie, Rice and Arabidopsis thaliana (Arabidopsis), we achieved excellent results with high accuracies of 89.12%, 84.72% and 71.74%, respectively. To further assess the prediction performance of the proposed approach, we compared it with the state-of-art support vector machine (SVM) classifier. To the best of our knowledge, we are the first to employ protein sequences information to predict PPIs in plants. Experimental results demonstrate that the proposed method has a great potential to become a powerful tool for exploring the plant cell function.Receiver Operating Characteristic curve (AUC) is calculated used for demonstrating the quality of prediction model. Assessment of Prediction Ability.In this article, we used 5-fold cross-validation to evaluate the predictive ability of our model in three plant data sets involving Maize, Rice and Arabidopsis. In this way, we can prevent overfitting and test the stability of the proposed method. More specifically, the whole data set is partitioned into five roughly equal parts, four of them were used to construct a training set and the rest one was adopted as a testing set. Thus, five models can be generated for the five sets of data. The cross validation has the advantages that it can minimize the impact of data dependency and improved the reliability of the results.The five-fold cross validation results of the proposed approach on the three plants datasets are listed in Table 1-3. Form Table 1, we can observe that when applying the proposed method to the Mazie data set, we obtained best prediction results of average accuracy, precision, sensitivity, and MCC were 89.12%, 87.49%, 91.32%, and 80.59%, with corresponding standard deviations 0.59%, 1.38%, 0.64%, and 0.94%, respectively. When exploring the proposed method on the Rice dataset, we yield the good results of average accuracy, precision, sensitivity, MCC of 84.72%, 85.04%, 84.44% and 84.10%, respectively. The standard deviations of these criteria values are 0.73%, 0.85%, 0.65% and 1.00% respectively. When predicting PPIs of Arabidopsis dataset, the proposed approach obtain...

show abstract

NGPINT: a next-generation protein–protein interaction software

Banerjee

Velásquez-Zapata

Fuerst

et al. 2020

Briefings in Bioinformatics

View full text Add to dashboard Cite

Mapping protein–protein interactions at a proteome scale is critical to understanding how cellular signaling networks respond to stimuli. Since eukaryotic genomes encode thousands of proteins, testing their interactions one-by-one is a challenging prospect. High-throughput yeast-two hybrid (Y2H) assays that employ next-generation sequencing to interrogate complementary DNA (cDNA) libraries represent an alternative approach that optimizes scale, cost and effort. We present NGPINT, a robust and scalable software to identify all putative interactors of a protein using Y2H in batch culture. NGPINT combines diverse tools to align sequence reads to target genomes, reconstruct prey fragments and compute gene enrichment under reporter selection. Central to this pipeline is the identification of fusion reads containing sequences derived from both the Y2H expression plasmid and the cDNA of interest. To reduce false positives, these fusion reads are evaluated as to whether the cDNA fragment forms an in-frame translational fusion with the Y2H transcription factor. NGPINT successfully recognized 95% of interactions in simulated test runs. As proof of concept, NGPINT was tested using published data sets and it recognized all validated interactions. NGPINT can process interaction data from any biosystem with an available genome or transcriptome reference, thus facilitating the discovery of protein–protein interactions in model and non-model organisms.

show abstract

Y2H-SCORES: A statistical framework to infer protein-protein interactions from next-generation yeast-two-hybrid sequence data

Cited by 5 publications

References 52 publications

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

Computational Prediction of Protein-Protein Interactions in Plants Using Only Sequence Information

NGPINT: a next-generation protein–protein interaction software

Contact Info

Product

Resources

About