Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is error-prone and impossible to perform manually. Many computational methods have been developed to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene expression to protein domains. Performance was judged on the basis of 13 common cluster validity indices. We developed a clustering analysis platform, ClustEval (http://clusteval.mpi-inf.mpg.de), to promote streamlined evaluation, comparison and reproducibility of clustering results in the future. This allowed us to objectively evaluate the performance of all tools on all data sets with up to 1,000 different parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We observed that there was no universal best performer, but on the basis of this wide-ranging comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to pick the appropriate tool for their data type and allows method developers to compare their tool to the state of the art.
With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
Introduction: Drug-resistant infections are becoming increasingly frequent worldwide, causing hundreds of thousands of deaths annually. This is partly due to the very limited set of protein drug targets known for human-infecting viral genomes. The eleven influenza virus proteins, for instance, exploit host cell factors for replication and suppression of the antiviral immune responses. A systems medicine approach to identify relevant and druggable host factors would dramatically expand therapeutic options. Therapeutic target identification, however, has hitherto relied on static molecular networks, whereas in reality the interactome, in particular during an infection, is subject to constant change. Methods: We developed time-course network enrichment (TiCoNE), an expert-centered approach for discovering temporal response pathways. In the first stage of TiCoNE, time-series expression data is clustered in a human-augmented manner to identify groups of biological entities with coherent temporal responses. Throughout this process, the expert can add, remove, merge, or split temporal patterns. The resulting groups can then be mapped to an interaction network to identify enriched pathways and to analyze cross-talk enrichments and depletions between groups. Finally, temporal response groups of two experiments can be intersected, to identify condition-variant response patterns that represent promising drug-target candidates. Results: We applied TiCoNE to human gene expression data for influenza A virus infection and rhino virus infection, respectively. We then identified coherent temporal response patterns and employed our cross-talk analysis to establish two potential timelines of systems-level host responses for either infection. Next, we compared the two phenotypes and unraveled condition-variant temporal groups interacting on a networks level. The highest-ranking ones we then validated via literature search and wet-lab experiments. This not only confirmed many of our candidates as previously known, but we also identified phospholipid scramblase 1 (encoded by PLSCR1 ) as a previously not recognized host factor that is essential for influenza A virus infection. Conclusion: With TiCoNE we developed a novel approach for conjointly analyzing molecular networks with time-series expression data and demonstrated its power by identifying temporal drug-targets. We provide proof-of-concept that not only novel targets can be identified using our approach, but also that anti-infective drug target discovery can be enhanced by investigating temporal molecular networks of the host in response to viral infection.
In the version of this article initially published, in the graph keys in Fig. 1i, the colors indicating 'Ob' and ' Ad' were red and blue, respectively, but should have been blue and red, respectively; the shapes indicating 'MUS' and 'BM' were a triangle and a square, respectively, but should have been a square and a triangle, respectively. The errors have been corrected in the HTML and PDF versions of the article.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.