With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing featureselection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets.
Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.
With the ever-increasing number of documents on the web, digital libraries, news sources, etc., the need of a text classifier that can classify massive amount of data is becoming more critical and difficult. The major problem in text classification is the high dimensionality of feature space. The Support Vector Machine (SVM) classifier is shown to perform consistently better than other text classification algorithms. However, the time taken for training a SVM model is more than other algorithms. We explore the use of the Ambiguity Measure (AM) feature selection method that uses only the most unambiguous keywords to predict the category of a document. Our analysis shows that AM reduces the training time by more than 50% than the scenario when no feature selection is used, while maintaining the accuracy of the text classifier equivalent to or better than using the whole feature set. We empirically show the effectiveness of our approach in outperforming seven different feature selection methods using two standard benchmark datasets.
Retrieving off-topic documents to a user's pre-defined area of interest via a search engine is potentially a violation of access rights and is a concern to every private, commercial, and governmental organization. We improve content-based off-topic search detection approaches by using a sequence of user queries versus the individual queries. In this approach, we reevaluate how off-topic a query is, based on the sequence of queries that preceded it. Our empirical results show that using the information from the queries in a given query window, the false alarm rate is reduced by a statistically significant amount.
Knowledge of relationships among categories is of the interest in different domains such as text classification, content analysis, and text mining. We propose and evaluate approaches to effectively identify relationships among document categories. Our proposed novel method capitalizes on the misclassification results of a text classifier to identify potential relationships among categories. We demonstrate that our system detects such relationships, even those relationships that assessors failed to identify in manual evaluation. Furthermore, we favorably compare the effectiveness of our methods with the state of art method and demonstrate a significant improvement in precision (34%) and recall (5%).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.