Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.
El notable crecimiento del volumen de datos genómicos y la enorme variedad de bases de datos que los almacenan, hacen indispensable disponer de mecanismos eficientes y eficaces de integración. En la actualidad se encuentran disponibles varias herramientas que ofrecen APIs (Interfaz de programación de aplicaciones) que permiten acceder a dicha información, que pueden ser utilizados tanto a través de lenguajes de programación como de navegadores a partir de servicios web. Sin embargo, en dominios específicos de la bioinformática como el caso de los micro ARN -pequeñas moléculas de ARN de gran interés por su capacidad de regular la actividad de otros genes- la mayoría de las soluciones recurren en problemas que dificultan su uso, incluyendo la falta de procesos que simplifiquen la actualización de sus bases de datos a medida que se publica nueva información, tiempos de respuesta inadecuados, dificultad para garantizar la escalabilidad, falta de consistencia en el formato de intercambio de datos, funcionalidad extremadamente limitada, errores por falta de mantenimiento, entre otros problemas frecuentes. En el presente trabajo se presenta Modulector, una solución que integra información de bases de datos genómicas, con bases de datos de micro ARNs (microARNs), para simplificar el acceso a las distintas dimensiones de información de los microARNs de interés (secuencias, fármacos y patologías asociadas, genes regulados, publicaciones científicas), poniendo especial énfasis en resolver las problemáticas técnicas comunes descritas anteriormente. Modulector brinda acceso a través de una API REST (API para la transferencia de estado representacional), garantiza tiempos de respuesta adecuados y escalabilidad, tiene capacidad de ordenamiento, filtro, búsqueda y paginado de resultados. La solución utiliza contenedores, simplificando el despliegue en cualquier servidor, lo que la hace adaptable para la mayoría de los casos de uso donde se quiere utilizar Modulector de manera privada. Toda la información retornada por Modulector se encuentra normalizada en formato JSON, haciéndola eficiente para su manipulación mediante cualquier herramienta de desarrollo. El código fuente de Modulector está disponible en https://github.com/omics-datascience/modulector.
Motivation Large-scale cancer genome projects have generated genomic, transcriptomic, epigenomic, and clinicopathological data from thousands of samples in almost every human tumor site. Although most omics data and their associated resources are publicly available, its full integration and interpretation to dissect the sources of gene expression modulation require specialized knowledge and software. Results We present Multiomix, an interactive cloud-based platform that allows biologists to identify genetic and epigenetic events associated with the transcriptional modulation of cancer-related genes through the analysis of multi-omics data available on public functional genomic databases or user-uploaded datasets. Multiomix consists of an integrated set of functions, pipelines, and a graphical user interface that allows retrieval, aggregation, analysis and visualization of different omics data sources. After the user provides the data to be analyzed, Multiomix identifies all significant correlations between mRNAs and non-mRNA genomics features (e.g.: miRNA, DNA methylation and CNV) across the genome, the predicted sequence based interactions (e.g., miRNA-mRNA), and their associated prognostic values. Availability Multiomix is available at https://www.multiomix.org The source code is freely available at https://github.com/omics-datascience/multiomix Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.