Extraction of valuable data from extensive datasets is a standout amongst the most vital exploration issues. Association rule mining is one of the highly used methods for this purpose. Finding possible associations between items in large transaction based datasets (finding frequent itemsets) is most crucial part of the association rule mining task. Many single-machine based association rule mining algorithms exist but the massive amount of data available these days is above the capacity of a single machine based algorithm. Therefore, to meet the demands of this ever-growing enormous data, there is a need for distributed association rule mining algorithm which can run on multiple machines. For these types of parallel/distributed applications, MapReduce is one of the best fault-tolerant frameworks. Hadoop is one of the most popular open-source software frameworks with MapReduce based approach for distributed storage and processing of large datasets using standalone clusters built from commodity hardware. But heavy disk I/O operation at each iteration of a highly iterative algorithm like Apriori makes Hadoop inefficient. A number of MapReduce based platforms are being developed for parallel computing in recent years. Among them, a platform, namely, Spark have attracted a lot of attention because of its inbuilt support to distributed computations. Therefore, we implemented a distributed association rule mining algorithm on Spark named as Adaptive-Miner which uses adaptive approach for finding frequent patterns with higher accuracy and efficiency. Adaptive-Miner uses an adaptive strategy based on the partial processing of datasets. Adaptive-Miner makes execution plans before every iteration and goes with the best suitable plan to minimize time and space complexity. Adpative-Miner is a dynamic association rule mining algorithm which change its approach based on the nature of dataset. Therefore, it is different and better than state-of-the-art static association rule mining algorithms. We conduct in-depth experiments to gain insight into the effectiveness, efficiency, and scalability of the Adaptive-Miner algorithm on Spark.
Neoadjuvant chemoradiotherapy is commonly used to treat rectal cancer but patients have different levels of response and/or toxic effects. As part of the Stratification in COloRecTal cancer (S:CORT) programme, we collected 257 rectal biopsies from two cohorts: Grampian (single hospital) and Aristotle (clinical trial). All patients had been subsequently treated with identical regimen of neoadjuvant radiotherapy and capecitabine. We performed trancriptomic, mutation and copy number profiling and aimed to identify biomarkers associated with the robust pathological endpoint of complete response (CR). Key biological determinants were identified by linear regression of different pre-defined, hypothesis-driven biomarkers for radiotherapy response, adjusted by the known confounders T and N stage. A novel RNA signature was derived using a personalised bioinformatical pipeline using a wide range of machine learning approaches. Results were validated in a publicly available transcriptomic cohort of 107 patients treated with similar dose of radiotherapy and 5-fluorouracil infusion. Further comparision of the biological determinants and the novel RNA signature were performed in the same cohorts and also TCGA by linear regression. Previously published transcriptomic signatures were retrieved and assessed in the validation, unseen cohort. Grampian and Aristotle cohorts had similar statistical power and showed similar associations of CR with biological candidates, 10 of them being significant or borderline (p<0.1). Accordingly, both cohorts were merged into a single discovery set to better assess which ones would show additive, independent association. Following multivariable stepwise regression the final model was composed of the immune biomarkers cytotoxic lymphocytes and CMS1 for radiosensitivity while the stromal TGFb Fibroblasts and epithelial APC mutations were for radioresistance. The first three variables were validated in the transcriptomic validation set (Cyt lymph OR 7.09, p=0.01; CMS1 OR 5.39, p=0.02; TGFb Fib OR 0.27, p=0.04). In parallel, a 33-gene signature, trained in the discovery cohort by a comprehensive machine learning pipeline, showed excellent predictive ability in the validation cohort (0.9 AUC; 88% accuracy, 90% sensitivity, 86% specificity). Most genes were associated with at least one of the four biological features identified in the discovery set, validation set and a third cohort of colorectal cancer resections. Our novel signature showed much better predictive ability than other previously published transcriptomic signatures in the validation, unseen cohort. The immune, stromal and epithelial components of rectal tumours are important players for prediction of CR to radiotherapy in rectal cancer. A 33-gene transcriptomic biomarker can be used to effectively select patients that are highly likely to achieve CR allowing organ preservation while modulation of the relevant biological features in the other patients may be tested to improve their poor outcome with current treatment strategies. Citation Format: Enric Domingo, Sanjay Rathee, Andrew Blake, Leslie M. Samuel, Graeme I. Murray, David Sebag-Montefiore, Simon Gollins, Nicholas West, Rubina Begum, Marian Duggan, Laura White, Susan Richman, Philip Quirke, James Robineau, Keara Redmond, Aikaterini Chatzipli, Ultan McDermott, Ian Tomlinson, Philip Dunne, Francesca Buffa, Tim Maughan. Stratification of radiotherapy and fluoropyrimidine-based chemotherapy from multi-omic profiling in rectal cancer biopsies [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr LB129.
Drug-induced liver injury (DILI) is a class of adverse drug reactions (ADR) that causes problems in both clinical and research settings. It is the most frequent cause of acute liver failure in the majority of Western countries and is a major cause of attrition of novel drug candidates. Manual trawling of the literature is the main route of deriving information on DILI from research studies. This makes it an inefficient process prone to human error. Therefore, an automatized AI model capable of retrieving DILI-related articles from the huge ocean of literature could be invaluable for the drug discovery community. In this study, we built an artificial intelligence (AI) model combining the power of natural language processing (NLP) and machine learning (ML) to address this problem. This model uses NLP to filter out meaningless text (e.g., stop words) and uses customized functions to extract relevant keywords such as singleton, pair, and triplet. These keywords are processed by an apriori pattern mining algorithm to extract relevant patterns which are used to estimate initial weightings for a ML classifier. Along with pattern importance and frequency, an FDA-approved drug list mentioning DILI adds extra confidence in classification. The combined power of these methods builds a DILI classifier (DILIC), with 94.91% cross-validation and 94.14% external validation accuracy. To make DILIC as accessible as possible, including to researchers without coding experience, an R Shiny app capable of classifying single or multiple entries for DILI is developed to enhance ease of user experience and made available at https://researchmind.co.uk/diliclassifier/. Additionally, a GitHub link (https://github.com/sanjaysinghrathi/DILI-Classifier) for app source code and ISMB extended video talk (https://www.youtube.com/watch?v=j305yIVi_f8) are available as supplementary materials.
Drug-Induced Liver Injury (DILI), despite its low occurrence rate, can cause severe side effects or even lead to death. Thus, it is one of the leading causes for terminating the development of new, and restricting the use of already-circulating, drugs. Moreover, its multifactorial nature, combined with a clinical presentation that often mimics other liver diseases, complicate the identification of DILI-related literature, which remains the main medium for sourcing results from the clinical practice and experimental studies. In this work– contributing to the ‘Literature AI for DILI Challenge’ of the Critical Assessment of Massive Data Analysis (CAMDA) 2021– we present an automated pipeline for distinguishing between DILI-positive and negative papers. We used Natural Language Processing (NLP) to filter out the uninformative parts of a text, and identify and extract mentions of chemicals and diseases. We combined that information with small-molecule and disease embeddings, which are capable of capturing chemical and disease similarities, to improve classification performance. The former are directly sourced from the Chemical Checker (CC). For the latter, we collected data that encode different aspects of disease similarity from the National Library of Medicine’s (NLM) Medical Subject Headings (MeSH) thesaurus and the Comparative Toxicogenomics Database (CTD). Following a similar procedure as the one used in the CC, vector representations for diseases were learnt and evaluated. Two Neural Network (NN) classifiers were developed: one that only accepts texts as input (baseline model) and an augmented classifier that also utilises chemical and disease embeddings (extended model). We trained, validated, and tested the models through a Nested Cross-Validation (NCV) scheme with 10 outer and 5 inner folds. During this, the baseline and extended models performed virtually identically, with macro F1-scores of 95.04 ± 0.61% and 94.80 ± 0.41%, respectively. Upon validation on an external, withheld, dataset, representing imbalanced data, the extended model achieved an F1-score of 91.14 ± 1.62%, outperforming its baseline counterpart, which got a lower score of 88.30 ± 2.44%. We make further comparisons between the classifiers and discuss future improvements and directions, including utilising chemical and disease embeddings for visualisation and exploratory analysis of the DILI-positive literature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.