Knowledge Supervised Text Classification with No Labeled Documents

Yang

et al. 2013

AMM

Most Web users today rely heavily on search engines to gather information. To achieve better search results, some algorithms such as PageRank have been developed. However, most Web search engines employ keyword-based search and thus have some natural weaknesses. Among these problems, a well-known one is that it is very difficult for search engines to infer semantics from user queries and returned results. Hence, despite of efforts of ranking search results, users may still have to navigate through a huge amount of Web pages to locate the desired resources. In this research, the researchers developed a clustering-based methodology to improve the performance of search engines. Instead of extracting features used for clustering from the returned documents, the proposed method extracts features from the delicious service, which is actually a tag provider service. By utilizing such information, the resulting system can benefit from crowd intelligence. The obtained information is then used for enhancing the performance of the ordinary k-means algorithm to achieve better clustering results.

Section: Related Workmentioning

confidence: 94%

Cluster Search Engine Results with Crowd Intelligence

Yang

et al. 2013

AMM

Proceedings of the Fourth ACM International Conference on Web Search and Data Mining

“…As far as text classification without labelled data is concerned, several works have been proposed recently for building flat text classifier without labelled data such as [8,23,25,14,13]. Generally, instead of using labelled documents, their approach uses retrieval or bootstrapping techniques to initially assign documents to topics represented by a title or a few keywords, then incrementally builds a classifier and refines the assignments through many iterations.…”

Section: Related Wordmentioning

confidence: 99%

Large-scale hierarchical text classification without labelled data

Ha-Thuc

Renders

2011

The traditional machine learning approaches for text classification often require labelled data for learning classifiers. However, when applied to large-scale classification involving thousands of categories, creating such labelled data is extremely expensive since typically the data is manually labelled by humans. Motivated by this, we propose a novel approach for large-scale hierarchical text classification which does not require any labelled data. We explore a perspective where the meaning of a category is not defined by humanlabelled documents, but by its description and more importantly its relationships with other categories (e.g. its ascendants and descendants). Specifically, we take advantage of the ontological knowledge in all phases of the whole process, namely when retrieving pseudo-labelled documents, when iteratively training the category models and when categorizing test documents. Our experiments based on a taxonomy containing 1131 categories and widely adopted in the news industry as a standard for the NewsML framework demonstrate the effectiveness of our approach in these phases both qualitatively and quantitatively. In particular, we emphasize that just by taking the simple ontological knowledge defined in the category hierarchy, we could automatically build a large-scale hierarchical classifier with reasonable performance of 67% in terms of the hierarchy-based F-1 measure.

“…As far as text classification without labelled data is concerned, several works have been proposed recently for building flat text classifiers without labelled data [31,86,95,49,45]. Generally, instead of using labelled documents, their approach In terms of hierarchical text classification based on languages models, our work has to be related to the methods proposed in [50,78,25,29].…”

Section: Related Workmentioning

confidence: 99%

Topic modeling and applications in Web 2.0

Srinivasan

Thuc

Along with the exponential growth of text data on the Web, particularly of the user-generated content, comes an increasing need for hierarchically organizing documents, retrieving documents accurately, and discovering evolutionary trends of various popular topics from the data. However, all of these are challenging due to the diversity, heterogeneity, noisiness and time-sensitivity of Web 2.0 data. Motivated by this, we tackle the challenges at a fundamental level, by proposing a novel topic modeling method with ontological guidance. It may be used to discover topic language models formalizing various terms relevant to given topics using the Web data. The topic model takes into account both the ontological relationships amongst the topics defined in a topic taxonomy and also word co-occurrence patterns in the data to automatically identify the portions in the data relevant to the topics. Then, it estimates language models for these topics from these relevant portions. At an application level, we use the topic model to propose novel approaches for three different tasks, namely hierarchical text classification without labeled data, information retrieval with pseudo-relevance feedback, and discovering topic evolutionary trends. Our classification experiment on the IPTC (International Press and Telecommunications Council) taxonomy, containing more than 1100 topics, shows that our approach achieves a performance of 67% in terms of the hierarchical version of the F-1 measure, without using any labeled data. Our retrieval experiments on five benchmark datasets show that compared to baseline retrieval (without pseudo-relevance feed