2017
DOI: 10.1186/s13040-017-0156-2
|View full text |Cite
|
Sign up to set email alerts
|

Cluster ensemble based on Random Forests for genetic data

Abstract: BackgroundClustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. G… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(24 citation statements)
references
References 44 publications
0
24
0
Order By: Relevance
“…In a non-model-based clustering method, a lasso-type penalty to selected features was used in the so-called sparse clustering 45 . As another option, Random forests provide a proximity measure that can capture different levels of co-occurring relationships between variables and can be converted into an unsupervised learning method, for which the derived proximity measure can be combined with a clustering approach 46 . However, there is no proof that methods that integrate feature selection in clustering outperform a two-stage approach in which the first stage screens for relevant features and the second stage applies conventional clustering methods on the pre-selected features 47 .…”
Section: Challenges and Pitfallsmentioning
confidence: 99%
“…In a non-model-based clustering method, a lasso-type penalty to selected features was used in the so-called sparse clustering 45 . As another option, Random forests provide a proximity measure that can capture different levels of co-occurring relationships between variables and can be converted into an unsupervised learning method, for which the derived proximity measure can be combined with a clustering approach 46 . However, there is no proof that methods that integrate feature selection in clustering outperform a two-stage approach in which the first stage screens for relevant features and the second stage applies conventional clustering methods on the pre-selected features 47 .…”
Section: Challenges and Pitfallsmentioning
confidence: 99%
“…These counts were then divided by the total number of trees in the forest. To create a dissimilarity matrix the square root of each frequency was taken after it was subtracted from one [ 8 , 13 , 74 ]. A principal coordinates analysis (PCoA) of the dissimilarities was used for visualizing and analyzing these differences.…”
Section: Methodsmentioning
confidence: 99%
“…This co-occurrence, S ( x i , x j ), is a similarity and can be found using the following equation: Where x i and x j is the vector representation of all terminal node positions of samples x i and x j in the forest, and N is the total number of trees in the forest. The similarity matrix, S , is then converted into a dissimilarity matrix, D (Equation Two) (17). This dissimilarity measure, while not a metric such as the Jaccard distance (29), can be used to investigate beta-diversity and can be constructed using either a supervised or an unsupervised approach (17).…”
Section: Methodsmentioning
confidence: 99%
“…The similarity matrix, S , is then converted into a dissimilarity matrix, D (Equation Two) (17). This dissimilarity measure, while not a metric such as the Jaccard distance (29), can be used to investigate beta-diversity and can be constructed using either a supervised or an unsupervised approach (17). To use decision tree ensembles in an unsupervised manner a second dataset is created such that the columns (ASVs) are randomly permuted.…”
Section: Methodsmentioning
confidence: 99%