Phrase-based document categorization revisited

Koster, C. H. A.; Beney, Jean

doi:10.1145/1651343.1651357

Cited by 12 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used the Balanced Winnow classifier (Dagan et al, 1997;Littlestone, 1988) implemented in the Linguistic Classification System (LCS; Koster et al, 2003;Koster and Beney, 2009). This algorithm assigns two weights (w + and w − )…”

Section: Classification Algorithmmentioning

confidence: 99%

Choice and pronunciation of words: Individual differences within a homogeneous group of speakers

Hanique¹,

Ernestus

Boves³

2015

Corpus Linguistics and Linguistic Theory

View full text Add to dashboard Cite

This paper investigates whether individual speakers forming a homogeneous group differ in their choice and pronunciation of words when engaged in casual conversation, and if so, how they differ. More specifically, it examines whether the Balanced Winnow classifier is able to distinguish between the twenty speakers of the Ernestus Corpus of Spontaneous Dutch, who all have the same social background. To examine differences in choice and pronunciation of words, instead of characteristics of the speech signal itself, classification was based on lexical and pronunciation features extracted from hand-made orthographic and automatically generated broad phonetic transcriptions. The lexical features consisted of words and two-word combinations. The pronunciation features represented pronunciation variations at the word and phone level that are typical for casual speech. The best classifier achieved a performance of 79.9% and was based on the lexical features and on the pronunciation features representing single phones and triphones. The speakers must thus differ from each other in these features. Inspection of the relevant features indicated that, among other things, the words relevant for classification generally do not contain much semantic content, and that speakers differ not only from each other in the use of these words but also in their pronunciation.

show abstract

Section: Classification Algorithmmentioning

confidence: 99%

Choice and pronunciation of words: Individual differences within a homogeneous group of speakers

Hanique¹,

Ernestus

Boves³

2015

Corpus Linguistics and Linguistic Theory

View full text Add to dashboard Cite

show abstract

“…As a first step, there is a need to impose some matrix structure on the unstructured data so that it can be accessible to the existing mining algorithms. The most common approach is to create a term document matrix by extracting terms that lead the columns and rows led by documents (Koster and Beney, 2009). Extracting all terms can lead to dimension curse and affect the algorithm efficiency; hence, terms are selected based on the frequency of occurrence.…”

Section: Introductionmentioning

confidence: 99%

Semantic key phrase-based model for document management

Bafna

Pramod

Shrwaikar

et al. 2019

BIJ

View full text Add to dashboard Cite

Purpose Document management is growing in importance proportionate to the growth of unstructured data, and its applications are increasing from process benchmarking to customer relationship management and so on. The purpose of this paper is to improve important components of document management that is keyword extraction and document clustering. It is achieved through knowledge extraction by updating the phrase document matrix. The objective is to manage documents by extending the phrase document matrix and achieve refined clusters. The study achieves consistency in cluster quality in spite of the increasing size of data set. Domain independence of the proposed method is tested and compared with other methods. Design/methodology/approach In this paper, a synset-based phrase document matrix construction method is proposed where semantically similar phrases are grouped to reduce the dimension curse. When a large collection of documents is to be processed, it includes some documents that are very much related to the topic of interest known as model documents and also the documents that deviate from the topic of interest. These non-relevant documents may affect the cluster quality. The first step in knowledge extraction from the unstructured textual data is converting it into structured form either as term frequency-inverse document frequency matrix or as phrase document matrix. Once in structured form, a range of mining algorithms from classification to clustering can be applied. Findings In the enhanced approach, the model documents are used to extract key phrases with synset groups, whereas the other documents participate in the construction of the feature matrix. It gives a better feature vector representation and improved cluster quality. Research limitations/implications Various applications that require managing of unstructured documents can use this approach by specifically incorporating the domain knowledge with a thesaurus. Practical implications Experiment pertaining to the academic domain is presented that categorizes research papers according to the context and topic, and this will help academicians to organize and build knowledge in a better way. The grouping and feature extraction for resume data can facilitate the candidate selection process. Social implications Applications like knowledge management, clustering of search engine results, different recommender systems like hotel recommender, task recommender, and so on, will benefit from this study. Hence, the study contributes to improving document management in business domains or areas of interest of its users from various strata’s of society. Originality/value The study proposed an improvement to document management approach that can be applied in various domains. The efficacy of the proposed approach and its enhancement is validated on three different data sets of well-articulated documents from data sets such as biography, resume and research papers. These results can be used for benchmarking further work carried out in these areas.

show abstract

“…Textual reviews should be converted into the matrix format before applying any clustering algorithm. Many text mining [12,13] methods use TF-IDF approach, to represent documents [14], but it assumes all words are independent while words usually occur in contextual groups or phrases [15,16]. Table I specifies the significant mile stones in the evolution of context-based hotel recommender system.…”

Section: Introductionmentioning

confidence: 99%

A Hotel Recommender System using Context-Based Clustering

Bafna¹,

Pramod²

2019

IJRTE

View full text Add to dashboard Cite

The web is one of the largest textual data repositories in the world. There is voluminous data in the digital world. To search for online hotels based on specific requirements of the user is not a very easy job. Ratings and reviews available on different travel websites help to some extent but gives generalized recommendations. A recommender system (RS) which uses reviews is known as content-based and is preferred, to produce a recommendation. Proposed RS maps all requirements of a traveler to features of a hotel and produces person specific recommendation. Phrase-based Recommender System is proposed to reduce efforts and time as compared with a traditional generalized recommender system. The proposed approach makes use of hotel reviews downloaded from TripAdvisor site. The technique initiates with phrase-based feature extraction followed by iterative clustering and ends with feature mapping and exports more relevant recommendations. Betterment of a technique is proved in terms of relevance, accuracy, scalability, and consistency by comparing precision and entropy refinement and corpus size with existing technique.

show abstract

Phrase-based document categorization revisited

Cited by 12 publications

References 11 publications

Choice and pronunciation of words: Individual differences within a homogeneous group of speakers

Choice and pronunciation of words: Individual differences within a homogeneous group of speakers

Semantic key phrase-based model for document management

A Hotel Recommender System using Context-Based Clustering

Contact Info

Product

Resources

About