“…For WordStat, the analysis was restricted to all words occurring 10 times or more. While the recommended minimum loading value for topic extraction using FA is 0.30 according to [20] or 0.20 as used by [5] this latter criterion resulted in many topics containing fewer than 10 words. The minimum loading criterion was thus reduced to 0.01, allowing for the extraction of 10 words for each topic for all three datasets.…”
Section: Methodsmentioning
confidence: 99%
“…FA was initially aimed to reduce the dimensionality of data to discover the latent content from the data [5,22]. In FA, each word w i in the vocabulary V containing all words in a corpus, ∈ , ∀ ∈ {1, … , }, can be represented as a linear function of m(< n) topics (aka common factors), ∈ , ∀ ∈ {1, … , }.…”
Section: Factor Analysismentioning
confidence: 99%
“…For example, research in information retrieval as early as 1963 used Factor Analysis (FA) on text documents to extract topics and automatically classify documents [5,6]. Whilst this work received a lot of attention as an unsupervised approach to document classification, though rarely has it been cited as an example of topic identification.…”
“…For WordStat, the analysis was restricted to all words occurring 10 times or more. While the recommended minimum loading value for topic extraction using FA is 0.30 according to [20] or 0.20 as used by [5] this latter criterion resulted in many topics containing fewer than 10 words. The minimum loading criterion was thus reduced to 0.01, allowing for the extraction of 10 words for each topic for all three datasets.…”
Section: Methodsmentioning
confidence: 99%
“…FA was initially aimed to reduce the dimensionality of data to discover the latent content from the data [5,22]. In FA, each word w i in the vocabulary V containing all words in a corpus, ∈ , ∀ ∈ {1, … , }, can be represented as a linear function of m(< n) topics (aka common factors), ∈ , ∀ ∈ {1, … , }.…”
Section: Factor Analysismentioning
confidence: 99%
“…For example, research in information retrieval as early as 1963 used Factor Analysis (FA) on text documents to extract topics and automatically classify documents [5,6]. Whilst this work received a lot of attention as an unsupervised approach to document classification, though rarely has it been cited as an example of topic identification.…”
“…TC is used in many application contexts, ranging from automatic document indexing based on a controlled vocabulary (Borko and Bernick 1963;Gray and Harley 1971;Field 1975), to document filtering (Amati and Crestani 1999;Iyer, Lewis et al 2000;Kim, Hahn et al 2000), word sense disambiguation (Gale, Church et al 1992;Escudero, Marquez et al 2000), population of hierarchical catalogues of Web resources (Chakrabarti, Dom et al 1998;Attardi, Gulli et al 1999;Oh, Myaeng et al 2000), and in general any application requiring document organization or selective and adaptive document dispatching.…”
Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from 'conventional' classification tasks and from each other.
“…Automatic classification of text documents has been one of the biggest challenges in natural language processing for decades [2], [15], [21], [22]. Distinguishing good and bad documents is relevant for various types of real-world situations such as finding useful Web pages or reviewing research papers.…”
Current patent systems face a serious problem of declining quality of patents as the larger number of applications make it difficult for patent officers to spend enough time for evaluating each application. For building a better patent system, it is necessary to define a public consensus on the quality of patent applications in a quantitative way. In this article, we tackle the problem of assessing the quality of patent applications based on machine learning and text mining techniques. For each patent application, our tool automatically computes a score called patentability, which indicates how likely it is that the application will be approved by the patent office. We employ a new statistical prediction model to estimate examination results (approval or rejection) based on a large data set including 0.3 million patent applications. The model computes the patentability score based on a set of feature variables including the text contents of the specification documents. Experimental results showed that our model outperforms a conventional method which uses only the structural properties of the documents. Since users can access the estimated result through a Web-browser-based GUI, this system allows both patent examiners and applicants to quickly detect weak applications and to find their specific flaws.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.