Automatic Document Classification

Borko, Harold; Bernick, Myrna

doi:10.1145/321160.321165

Cited by 145 publications

(70 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For WordStat, the analysis was restricted to all words occurring 10 times or more. While the recommended minimum loading value for topic extraction using FA is 0.30 according to [20] or 0.20 as used by [5] this latter criterion resulted in many topics containing fewer than 10 words. The minimum loading criterion was thus reduced to 0.01, allowing for the extraction of 10 words for each topic for all three datasets.…”

Section: Methodsmentioning

confidence: 99%

“…FA was initially aimed to reduce the dimensionality of data to discover the latent content from the data [5,22]. In FA, each word w i in the vocabulary V containing all words in a corpus, ∈ , ∀ ∈ {1, … , }, can be represented as a linear function of m(< n) topics (aka common factors), ∈ , ∀ ∈ {1, … , }.…”

Section: Factor Analysismentioning

confidence: 99%

“…For example, research in information retrieval as early as 1963 used Factor Analysis (FA) on text documents to extract topics and automatically classify documents [5,6]. Whilst this work received a lot of attention as an unsupervised approach to document classification, though rarely has it been cited as an example of topic identification.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History

Péladeau¹,

Davoodi²

2018

Proceedings of the 51st Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Factor Analysismentioning

confidence: 99%

See 1 more Smart Citation

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History

Péladeau¹,

Davoodi²

2018

Proceedings of the 51st Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

show abstract

“…TC is used in many application contexts, ranging from automatic document indexing based on a controlled vocabulary (Borko and Bernick 1963;Gray and Harley 1971;Field 1975), to document filtering (Amati and Crestani 1999;Iyer, Lewis et al 2000;Kim, Hahn et al 2000), word sense disambiguation (Gale, Church et al 1992;Escudero, Marquez et al 2000), population of hierarchical catalogues of Web resources (Chakrabarti, Dom et al 1998;Attardi, Gulli et al 1999;Oh, Myaeng et al 2000), and in general any application requiring document organization or selective and adaptive document dispatching.…”

Section: Text Categorizationmentioning

confidence: 99%

Text and Hypertext Categorization

Benbrahim

Bramer

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from 'conventional' classification tasks and from each other.

show abstract

“…Automatic classification of text documents has been one of the biggest challenges in natural language processing for decades [2], [15], [21], [22]. Distinguishing good and bad documents is relevant for various types of real-world situations such as finding useful Web pages or reviewing research papers.…”

Section: Introductionmentioning

confidence: 99%

Modeling Patent Quality: A System for Large-scale Patentability Analysis using Text Mining

Hido

Suzuki

Nishiyama

et al. 2012

Journal of Information Processing

View full text Add to dashboard Cite

Current patent systems face a serious problem of declining quality of patents as the larger number of applications make it difficult for patent officers to spend enough time for evaluating each application. For building a better patent system, it is necessary to define a public consensus on the quality of patent applications in a quantitative way. In this article, we tackle the problem of assessing the quality of patent applications based on machine learning and text mining techniques. For each patent application, our tool automatically computes a score called patentability, which indicates how likely it is that the application will be approved by the patent office. We employ a new statistical prediction model to estimate examination results (approval or rejection) based on a large data set including 0.3 million patent applications. The model computes the patentability score based on a set of feature variables including the text contents of the specification documents. Experimental results showed that our model outperforms a conventional method which uses only the structural properties of the documents. Since users can access the estimated result through a Web-browser-based GUI, this system allows both patent examiners and applicants to quickly detect weak applications and to find their specific flaws.

show abstract

Automatic Document Classification

Cited by 145 publications

References 2 publications

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History

Text and Hypertext Categorization

Modeling Patent Quality: A System for Large-scale Patentability Analysis using Text Mining

Contact Info

Product

Resources

About