Jungrim Kim scite author profile

As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.

show abstract

DSS: A biclustering method to identify diverse and state specific gene modules in gene expression data

Kim

Yeu

Kim

et al. 2016

View full text Add to dashboard Cite

Discovering phenotype specific gene module using a novel biclustering algorithm in colorectal cancer

Kim

Yong-Jin

Park

et al. 2014

View full text Add to dashboard Cite

Gene clustering is a method for finding gene sets which are related to the same biological processes or molecular function. In order to find these gene sets, previous studies have clustered genes which showed similar mRNA expression or a specific expression pattern in a (sub) sample set. However, for two contrasting groups of samples, it is not easy to identify gene sets which show significant expression pattern in only one group using current gene clustering methods. Existing biclustering methods use only one group (disease) of samples. It is hard to identify disease specific biclusters which are differentially expressed in the disease although those methods can find biclusters which have specific expression pattern. Here, we proposed a novel method using a genetic algorithm in gene expression data, in order to find gene sets which can represent specific subtype of cancer. Proposed method finds gene sets which have statistically differential mRNA expression on two contrasting samples and fraction of cancer samples. The resulting gene modules share higher number of GO (Gene Ontology) terms related to a specific disease than gene modules identified by current algorithms. We also identify that when we integrate protein-protein interaction data with gene expression data of colorectal cancer samples, proposed method can find more functionally related gene sets.

show abstract

SSL: Inferring disease-related genes using sentence structure and literature data

Kim

Choi

Kim

et al. 2017

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jungrim Kim

CSnet: Constructing symptom network based on disease-symptom relationships

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network

DSS: A biclustering method to identify diverse and state specific gene modules in gene expression data

Discovering phenotype specific gene module using a novel biclustering algorithm in colorectal cancer

SSL: Inferring disease-related genes using sentence structure and literature data

Contact Info

Product

Resources

About