Abstract:Our goal was to find new diagnostic and prognostic biomarkers in bladder cancer (BCa), and to predict molecular mechanisms and processes involved in BCa development and progression. Notably, the data collection is an inevitable step and time-consuming work. Furthermore, identification of the complementary results and considerable literature retrieval were requested. Here, we provide detailed information of the used datasets, the study design, and on data mining. We analyzed differentially expressed genes (DEGs… Show more
“…Data was organized by [8] and released under creative commons license. Authors of the data-set provided N = 406 anonymized clinical samples containing gene expression values of 14 hub genes related to bladder cancer.…”
Section: Methodsmentioning
confidence: 99%
“…Nonetheless, these groups do not match perfectly the hub and seed sets, suggesting that genes among these two groups do not contribute equally to the informative content of the data-set. In the original paper [8], Dr. Zhang described an opposite behavior of CRYAB, TPM1, and CASQ2 genes compared to other hub genes. The negative correlation between these three genes and the other hub genes appears on the dendrogram, with their inclusion in the seed group.…”
Section: Original Data Assessmentmentioning
confidence: 99%
“…The negative correlation between these three genes and the other hub genes appears on the dendrogram, with their inclusion in the seed group. Moreover, gene expression data was pre-selected by medical doctors that authored the original data-set [8], and feature selection techniques based on variance or correlation may not consider all the intuitions underlying their research. To preserve all the knowledge present in the data-set and reduce the feature space removing redundant information, dimensionality reduction was usually preferred to feature selection because it creates new synthetic features by combining the original ones.…”
Bioinformatic techniques targeting gene expression data require specific analysis pipelines with the aim of studying properties, adaptation, and disease outcomes in a sample population. Present investigation compared together results of four numerical experiments modeling survival rates from bladder cancer genetic profiles. Research showed that a sequence of two discretization phases produced remarkable results compared to a classic approach employing one discretization of gene expression data. Analysis involving two discretization phases consisted of a primary discretizer followed by refinement or pre-binning input values before the main discretization scheme. Among all tests, the best model encloses a sequence of data transformation to compensate skewness, data discretization phase with class-attribute interdependence maximization algorithm, and final classification by voting feature intervals, a classifier that also provides discrete interval optimization.
“…Data was organized by [8] and released under creative commons license. Authors of the data-set provided N = 406 anonymized clinical samples containing gene expression values of 14 hub genes related to bladder cancer.…”
Section: Methodsmentioning
confidence: 99%
“…Nonetheless, these groups do not match perfectly the hub and seed sets, suggesting that genes among these two groups do not contribute equally to the informative content of the data-set. In the original paper [8], Dr. Zhang described an opposite behavior of CRYAB, TPM1, and CASQ2 genes compared to other hub genes. The negative correlation between these three genes and the other hub genes appears on the dendrogram, with their inclusion in the seed group.…”
Section: Original Data Assessmentmentioning
confidence: 99%
“…The negative correlation between these three genes and the other hub genes appears on the dendrogram, with their inclusion in the seed group. Moreover, gene expression data was pre-selected by medical doctors that authored the original data-set [8], and feature selection techniques based on variance or correlation may not consider all the intuitions underlying their research. To preserve all the knowledge present in the data-set and reduce the feature space removing redundant information, dimensionality reduction was usually preferred to feature selection because it creates new synthetic features by combining the original ones.…”
Bioinformatic techniques targeting gene expression data require specific analysis pipelines with the aim of studying properties, adaptation, and disease outcomes in a sample population. Present investigation compared together results of four numerical experiments modeling survival rates from bladder cancer genetic profiles. Research showed that a sequence of two discretization phases produced remarkable results compared to a classic approach employing one discretization of gene expression data. Analysis involving two discretization phases consisted of a primary discretizer followed by refinement or pre-binning input values before the main discretization scheme. Among all tests, the best model encloses a sequence of data transformation to compensate skewness, data discretization phase with class-attribute interdependence maximization algorithm, and final classification by voting feature intervals, a classifier that also provides discrete interval optimization.
Introduction
Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.
Methods
Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions.
Results
Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction).
Conclusions
The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients’ conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.