A hybrid evolutionary computation approach with its application for optimizing text document clustering

Song, Wei; Qiao, Yingying; Park, Soon Cheol; Xia, Qian

doi:10.1016/j.eswa.2014.11.003

Cited by 47 publications

(22 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In common practice, we set the number of terms to be constant among topics, so above equation can be simplified as: s t ð Þ ¼ N=n, where n is the number of unique terms in any topic t. Then, if s t ð Þ is greater than a user defined threshold, t i and t j are merged. Compared to extant topic modelling methods (Song, Qiao, Park, & Qian, 2015;Wang, Mao, Wang, & Guo, 2017) that utilise Gaussian-Poisson distribution to approximate the document-topic, and topic-word distributions, or by optimising a log-linear model (Li, Duan, et al, 2017), our method restricts the overlap between topics, and are more internally consistent. Based on the above method, the top 5 topics in FLSs are reported in Table 2.…”

Section: Feature Engineering From Flssmentioning

confidence: 99%

Analysing forward-looking statements in initial public offering prospectuses: a text analytics approach

Tao

Deokar

Deshmukh

2018

Journal of Business Analytics

View full text Add to dashboard Cite

Forward-looking statements (FLSs) have informational value in applications such as predicting stock prices. Management Discussion & Analysis (MD&A) sections in initial public offering (IPO) prospectuses contain FLSs that provide prospective information about the company's future growth and performance. This study focuses on evaluating the relationship between features extracted from FLSs and IPO valuation. To that end, we propose an analytical pipeline for identifying FLSs using machine learning techniques. The FLS classifier is built on the best performing deep learning architecture that outperforms extant methods reported in related studies. In order to demonstrate the value of identified FLSs, we conduct predictive analysis of pre-IPO price revisions and post-IPO first-day returns. We engineer a variety of linguistics features from FLSs including topics, sentiments, readability, semantic similarity, and general text features. The study finds that FLS features are more predictive for pre-IPO as compared to post-IPO valuation prediction. The analytical pipeline contributes to the text classification knowledge base while the findings from the predictive analysis shed light on understanding the underpricing phenomenon occurring in the IPO process.

show abstract

Section: Feature Engineering From Flssmentioning

confidence: 99%

Analysing forward-looking statements in initial public offering prospectuses: a text analytics approach

Tao

Deokar

Deshmukh

2018

Journal of Business Analytics

View full text Add to dashboard Cite

show abstract

“…The data on this set is considered particularly noisy and as might be expected does include complications such as duplicate entries and cross postings. We construct a 500 document subset of the 20 Newsgroup dataset in the same way as Song et al [21] by randomly taking 100 documents from five categories (comp.os.ms-windows.misc, misc.forsale, rec.sport.hockey, sci.space, soc.religion.christian).…”

Section: ) 20 Newsgroup (Name: 20ng5)mentioning

confidence: 99%

“…Genetic algorithms have been used in text clustering [2]. For example, Song et al [21] have used a GA in combination with techniques based on swarm intelligence to optimise text clustering. In this case GA is used to find an optimal set of centres for text clusters.…”

Section: Introductionmentioning

confidence: 99%

Document clustering with evolved search queries

Hirsch

Nuovo

2017

2017 IEEE Congress on Evolutionary Computation (CEC)

View full text Add to dashboard Cite

Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.

show abstract

“…In some literatures, additional information is introduced for text clustering such as side-information [40] and privileged information [41]. What is more, several global optimization algorithms are utilized for text clustering such as particle swarm optimization (PSO) algorithm [42,43] and bee colony optimization (BCO) algorithm [44,45].…”

Section: Clustering Algorithmmentioning

confidence: 99%

A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Zhang

Zhao

et al. 2017

Mathematical Problems in Engineering

View full text Add to dashboard Cite

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.

show abstract

A hybrid evolutionary computation approach with its application for optimizing text document clustering

Cited by 47 publications

References 31 publications

Analysing forward-looking statements in initial public offering prospectuses: a text analytics approach

Analysing forward-looking statements in initial public offering prospectuses: a text analytics approach

Document clustering with evolved search queries

A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Contact Info

Product

Resources

About