Cheng Xiang Zhai scite author profile

Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.

show abstract

Exploiting ontology graph for predicting sparsely annotated gene function

Wang

et al. 2015

View full text Add to dashboard Cite

Motivation: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this ‘overfitting’ issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog.Results: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions.Availability and implementation: https://github.com/wangshenguiuc/clusDCA.Contact: jianpeng@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.

show abstract

An Introduction to Text Mining

Aggarwal

Zhai

2012

144

View full text Add to dashboard Cite

Title language model for information retrieval

Jin

Hauptmann

Zhai

2002

View full text Add to dashboard Cite

In this paper, we propose a new language model, namely, a title language model, for information retrieval. Different from the traditional language model used for retrieval, we define the conditional probability P(Q|D) as the probability of using query Q as the title for document D. We adopted the statistical translation model learned from the title and document pairs in the collection to compute the probability P(Q|D). To avoid the sparse data problem, we propose two new smoothing methods. In the experiments with four different TREC document collections, the title language model for information retrieval with the new smoothing method outperforms both the traditional language model and the vector space model for IR significantly.

show abstract

Predicting future popularity trend of events in microblogging platforms

Gupta¹,

Gao

Zhai³

et al. 2012

Proc of Assoc for Info

View full text Add to dashboard Cite

The fast information sharing on Twitter from millions of users all over the world leads to almost real-time reporting of events. It is extremely important for business and administrative decision makers to learn events' popularity as quickly as possible, as it can buy extra precious time for them to make informed decisions. Therefore, we introduce the problem of predicting future popularity trend of events on microblogging platforms. Traditionally, trend prediction has been performed by using time series analysis of past popularity to forecast the future popularity changes. As we can encode the rich Twitter dynamics using a rich variety of features from microblogging data, we explore regression, classification and hybrid approaches, using a large set of popularity, social and event features, to predict event popularity. Experimental results on two real datasets of 18382 events extracted from ~133 million tweets show the effectiveness of the extracted features and learning approaches. The predicted popularity trend of events can be directly used for a variety of applications including recommendation systems, ad keywords bidding price decisions, stock trading decisions, dynamic ticket pricing for sports events, etc.

show abstract

Exploiting query history for document ranking in interactive information retrieval

Shen

Zhai

2003

View full text Add to dashboard Cite

show abstract

Enriching text representation with frequent pattern mining for probabilistic topic modeling

Kim

Park

et al. 2012

Proc of Assoc for Info

View full text Add to dashboard Cite

Probabilistic topic models have been proven very useful for many text mining tasks. Although many variants of topic models have been proposed, most existing works are based on the bag‐of‐words representation of text in which word combination and order are generally ignored, resulting in inaccurate semantic representation of text. In this paper, we propose a general way to go beyond the bag‐of‐words representation for topic modeling by applying frequent pattern mining to discover frequent word patterns that can capture semantic associations between words and then using them as additional supplementary semantic units to augment the conventional bag‐of‐words representation. By viewing a topic model as a generative model for such augmented text data, we can go beyond the bag‐of‐words assumption to potentially capture more semantic associations between words. Since efficient algorithms for mining frequent word patterns are available, this general strategy for improving topic models can be applied to improve any topic models without substantially increasing the computational complexity of the model. Experiment results show that such a frequent pattern‐based data enrichment approach can improve over two representative existing probabilistic topic models for the classification task. We also studied variations of frequent pattern usage in topic modeling and found that using compressed and closed patterns performs best.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Cheng Xiang Zhai

A Survey of Text Classification Algorithms

A Survey of Text Clustering Algorithms

Exploiting ontology graph for predicting sparsely annotated gene function

An Introduction to Text Mining

Title language model for information retrieval

Predicting future popularity trend of events in microblogging platforms

Exploiting query history for document ranking in interactive information retrieval

Enriching text representation with frequent pattern mining for probabilistic topic modeling

Contact Info

Product

Resources

About