Feature selection for gene prediction in metagenomic fragments

Al-Ajlan, Amani; Allali, Achraf El

doi:10.1186/s13040-018-0170-z

Cited by 13 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This can safely be done for redundant features, i.e., features that do not give meaningful information about the class, are correlated, or are derived from other features in the data set. Amani Al-Ajlan and Achraf El Allali proposed a methodology [ 72 ] for feature selection using maximum Relevance Minimum Redundancy (mRMR) to find the most relevant features. The feature extraction algorithm has shown good results in improving classification results from Support Vector Machine (SVM)-based models.…”

Section: Discussionmentioning

confidence: 99%

Literature on Applied Machine Learning in Metagenomic Classification: A Scoping Review

Tonkovic

Kalajdziski

Zdravevski

et al. 2020

Biology

View full text Add to dashboard Cite

Applied machine learning in bioinformatics is growing as computer science slowly invades all research spheres. With the arrival of modern next-generation DNA sequencing algorithms, metagenomics is becoming an increasingly interesting research field as it finds countless practical applications exploiting the vast amounts of generated data. This study aims to scope the scientific literature in the field of metagenomic classification in the time interval 2008–2019 and provide an evolutionary timeline of data processing and machine learning in this field. This study follows the scoping review methodology and PRISMA guidelines to identify and process the available literature. Natural Language Processing (NLP) is deployed to ensure efficient and exhaustive search of the literary corpus of three large digital libraries: IEEE, PubMed, and Springer. The search is based on keywords and properties looked up using the digital libraries’ search engines. The scoping review results reveal an increasing number of research papers related to metagenomic classification over the past decade. The research is mainly focused on metagenomic classifiers, identifying scope specific metrics for model evaluation, data set sanitization, and dimensionality reduction. Out of all of these subproblems, data preprocessing is the least researched with considerable potential for improvement.

show abstract

Section: Discussionmentioning

confidence: 99%

Literature on Applied Machine Learning in Metagenomic Classification: A Scoping Review

Tonkovic

Kalajdziski

Zdravevski

et al. 2020

Biology

View full text Add to dashboard Cite

show abstract

“…In addition, strategies such as principal component analysis (PCA) [37] , n-Grams, minimal-redundance maximum-Relevance (mRMR) [38] are widely used in order to select a subset of the features. Studies show that applying feature selection algorithms produce better performance than using the extracted features directly or applying a multi-layer machine learning approach [39] , [40] . However, feature selection should be performed on a different dataset than the training to avoid biases in the performance analysis during testing.…”

Section: Machine Learningmentioning

confidence: 99%

Machine learning applications in RNA modification sites prediction

Allali¹,

Elhamraoui²,

Daoud³

2021

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

show abstract

“…Other tools, often used for metagenomics analysis, are based on vectorization of sequence features of ORF-candidates, which is an efficient conversion of nucleotide sequence into a vector of sequence features (Al-Ajlan and El Allali, 2018;Al-Ajlan and El Allali, 2019;El Allali and Rose, 2013;Hoff, et al, 2008;Trimble, et al, 2012;Zhang, et al, 2017). These features can be evaluated by a machine learning models in order to select true ORFs.…”

Section: Introductionmentioning

confidence: 99%

“…A promising way that could wave the main drawbacks of existing algorithms, is to select the informative parameters describing candidate ORF fragments using advanced algorithms of nucleotide sequence vectorization (Bao, et al, 2014;Mao, et al, 2014), and then to apply an optimal prediction algorithm and identify the most probable or true ORF sequence among candidates. This approach needs only a -4 -limited set of input features (here we used 104), which is significantly less than 4000-5000 considered earlier (Al-Ajlan and El Allali, 2018). The use of a random forest classifier (Breiman, 2001) has several advantages over more complex deep learning techniques (Al-Ajlan and El Allali, 2019;Wen, et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

ORFhunteR: an accurate approach for the automatic identification and annotation of open reading frames in human mRNA molecules

Grinev

et al. 2021

Preprint

View full text Add to dashboard Cite

Motivation: Modern methods of whole transcriptome sequencing accurately recover nucleotide sequences of RNA molecules present in cells and allow for determining their quantitative abundances. The coding potential of such molecules can be estimated using open reading frames (ORF) finding algorithms, implemented in a number of software packages. However, these algorithms show somewhat limited accuracy, are intended for single-molecule analysis and do not allow selecting proper ORFs in the case of long mRNAs containing multiple ORF candidates. Results: We developed a computational approach, corresponding machine learning model and a package, dedicated to automatic identification of the ORFs in large sets of human mRNA molecules. It is based on vectorization of nucleotide sequences into features, followed by classification using a random forest. The predictive model was validated on sets of human mRNA molecules from the NCBI RefSeq and Ensembl databases and demonstrated almost 95% accuracy in detecting true ORFs. The developed methods and pre-trained classification model were implemented in a powerful ORFhunteR computational tool that performs an automatic identification of true ORFs among large set of human mRNA molecules. Availability and implementation: The developed open-source R package ORFhunteR is available for the community at GitHub repository (https://github.com/rfctbio-bsu/ORFhunteR), from Bioconductor (https://bioconductor.org/packages/devel/bioc/html/ORFhunteR.html) and as a web application (http://orfhunter.bsu.by).

show abstract

Feature selection for gene prediction in metagenomic fragments

Cited by 13 publications

References 29 publications

Literature on Applied Machine Learning in Metagenomic Classification: A Scoping Review

Literature on Applied Machine Learning in Metagenomic Classification: A Scoping Review

Machine learning applications in RNA modification sites prediction

ORFhunteR: an accurate approach for the automatic identification and annotation of open reading frames in human mRNA molecules

Contact Info

Product

Resources

About