Fast and accurate text classification via multiple linear discriminant projections

Chakrabarti, Soumen; Roy, Shourya; Soundalgekar, Mahesh V.

doi:10.1007/s00778-003-0098-9

Cited by 107 publications

(32 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…20NG contains 19,997 documents of 20 topics (categories). As in Chakrabarti et al (2003), we employ 75% of the documents for training, and the remaining 25% for testing. Therefore, there are 14,997 training documents and 5,000 test documents, which are uniformly extracted from the original 19,997 documents.…”

Section: Methodsmentioning

confidence: 99%

“…SVM is a popular technique in TC (e.g., Bennett & Nguyen, 2009; Xue et al, 2008; Qi & Davison, 2008; Chakrabarti et al, 2003; Yang & Lin, 1999). Previous studies often found that SVM outperforms many classifiers.…”

Section: Methodsmentioning

confidence: 99%

“…Previous studies often found that SVM outperforms many classifiers. We employ SVM Light that is publicly available9 (Joachims, 1999) and was tested in many previous studies (e.g., Chakrabarti et al, 2003; Yang & Lin, 1999). We set the “cost‐factor” parameter to 10 (i.e., relative weights of errors on positive examples to errors on negative examples), which helps SVM to have a performance similar to the best performance reported in previous studies (Experimental Results, below).…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Context‐based term frequency assessment for text classification

Liu

2009

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

Automatic text classification (TC) is essential for the management of information. To properly classify a document d, it is essential to identify the semantics of each term t in d, while the semantics heavily depend on context (neighboring terms) of t in d. Therefore, we present a technique CTFA (Context-based Term Frequency Assessment) that improves text classifiers by considering term contexts in test documents. The results of the term context recognition are used to assess term frequencies of terms, and hence CTFA may easily work with various kinds of text classifiers that base their TC decisions on term frequencies, without needing to modify the classifiers. Moreover, CTFA is efficient, and neither huge memory nor domainspecific knowledge is required. Empirical results show that CTFA successfully enhances performance of several kinds of text classifiers on different experimental data.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Context‐based term frequency assessment for text classification

Liu

2009

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

show abstract

“…Texts and documents, especially with weighted feature extraction, generate a huge number of features. Many researchers have applied random projection to text data [83,84] for text mining, text classification, and dimensionality reduction. In this section, we review some basic random projection techniques.…”

Section: Random Projectionmentioning

confidence: 99%

Text Classification Algorithms: A Survey

et al. 2019

View full text Add to dashboard Cite

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.Spelling correction is an optional pre-processing step. Typos (short for typographical errors) are commonly present in texts and documents, especially in social media text data sets (e.g., Twitter). Many algorithms, techniques, and methods have addressed this problem in NLP [49]. Many techniques and methods are available for researchers including hashing-based and context-sensitive spelling correction techniques [50], as well as spelling correction using Trie and Damerau-Levenshtein distance bigram [51]. StemmingIn NLP, one word could appear in different forms (i.e., singular and plural noun form) while the semantic meaning of each form is the same [52]. One method for consolidating different forms of a word into the same feature space is stemming. Text stemming modifies words to obtain variant word forms using different linguistic processes such as affixation (addition of affixes) [53,54]. For example, the stem of the word "studying" is "study". LemmatizationLemmatization is a NLP process that replaces the suffix of a word with a different one or removes the suffix of a word completely to get the basic word form (lemma) [54][55][56]. Syntactic Word RepresentationMany researchers have worked on this text feature extraction technique to solve the loosing syntactic and semantic relation between words. Many researchers addressed novel techniques for solving this problem, but many of these techniques still have limitations. In [57], a model was introduced in which the usefulness of including syntactic and semantic knowledge in the text representation for the selection of sentences comes from technical genomic texts. The other solution for syntactic problem is using the n-gram technique for feature extraction. N-GramThe n-gram technique is a set of n-word which occurs "in that order" in a text set. This is not a representation of a text, but it could be used as a feature to represent a text.BOW is a representation of a text using its words (1-gram) which loses their order (syntactic). This model is very easy to obtain and the text can be represented through a vector, generally of a manageable size of the text. On the ...

show abstract

“…With increasing scientific papers, Internet information, and other text-format data, automatic text categorization plays an important role in information retrieval, data mining and machine learning [13]. Commonly used classification methods are back propagation neural network, decision trees, K-nearest neighbor (KNN), naive Bayes and SVM, and especially SVM achieves good performance on the effectiveness and stability of classification [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. However, most of them are supervised learning algorithms and training data or labeled samples often demand great human efforts in practical applications.…”

Section: Introductionmentioning

confidence: 99%

Text classification based on deep belief network and softmax regression

Jiang

Liang

Feng

et al. 2016

Neural Comput & Applic

222

103

View full text Add to dashboard Cite

In this paper, we propose a novel hybrid text classification model based on deep belief network and softmax regression. To solve the sparse high-dimensional matrix computation problem of texts data, a deep belief network is introduced. After the feature extraction with DBN, softmax regression is employed to classify the text in the learned feature space. In pre-training procedures, the deep belief network and softmax regression are first trained, respectively. Then, in the fine-tuning stage, they are transformed into a coherent whole and the system parameters are optimized with Limited-memory BroydenFletcher-Goldfarb-Shanno algorithm. The experimental results on Reuters-21,578 and 20-Newsgroup corpus show that the proposed model can converge at fine-tuning stage and perform significantly better than the classical algorithms, such as SVM and KNN.

show abstract

Fast and accurate text classification via multiple linear discriminant projections

Cited by 107 publications

References 27 publications

Context‐based term frequency assessment for text classification

Context‐based term frequency assessment for text classification

Text Classification Algorithms: A Survey

Text classification based on deep belief network and softmax regression

Contact Info

Product

Resources

About