Abstract:Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A document is processed as a sequence of terms, and the goal is to devise a method that can make predictions as fast as possi… Show more
“…The window size chosen was w = 3, that is, three terms were read between each run of the early text classification framework. Based on Escalante's work [3] we chose a naïve Bayes classifier for the CPI model. The performance for the partial documents can be seen in Fig 2. Clearly, we can classify documents without reading all terms.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…In [3] the authors propose an adaptation of Naïve Bayes to tackle the problem of classification with partial information. Although they achieve similar performance to state of the art models that read the entire document, they do not approach the DMC problem.…”
Section: Related Workmentioning
confidence: 99%
“…To date, only a few papers have approached this kind of scenarios [2,3,5]. Despite its low popularity, this topic has a major potential in practical applications.…”
Abstract. The problem of classification in supervised learning is a widely studied one. Nonetheless, there are scenarios that received little attention despite its applicability. One of such scenarios is early text classification, where one needs to know the category of a document as soon as possible. The importance of this variant of the classification problem is evident in tasks like sexual predator detection, where one wants to identify an offender as early as possible. This paper presents a framework for early text classification which highlights the two main pieces involved in this problem: classification with partial information and deciding the moment of classification. In this context, a novel approach that learns the second component (when classify) and an adaptation of a temporal measurement for multi-class problems are introduced. Results with a classical text classification corpus in comparison against a model that reads the entire documents confirm the feasibility of our approach.
“…The window size chosen was w = 3, that is, three terms were read between each run of the early text classification framework. Based on Escalante's work [3] we chose a naïve Bayes classifier for the CPI model. The performance for the partial documents can be seen in Fig 2. Clearly, we can classify documents without reading all terms.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…In [3] the authors propose an adaptation of Naïve Bayes to tackle the problem of classification with partial information. Although they achieve similar performance to state of the art models that read the entire document, they do not approach the DMC problem.…”
Section: Related Workmentioning
confidence: 99%
“…To date, only a few papers have approached this kind of scenarios [2,3,5]. Despite its low popularity, this topic has a major potential in practical applications.…”
Abstract. The problem of classification in supervised learning is a widely studied one. Nonetheless, there are scenarios that received little attention despite its applicability. One of such scenarios is early text classification, where one needs to know the category of a document as soon as possible. The importance of this variant of the classification problem is evident in tasks like sexual predator detection, where one wants to identify an offender as early as possible. This paper presents a framework for early text classification which highlights the two main pieces involved in this problem: classification with partial information and deciding the moment of classification. In this context, a novel approach that learns the second component (when classify) and an adaptation of a temporal measurement for multi-class problems are introduced. Results with a classical text classification corpus in comparison against a model that reads the entire documents confirm the feasibility of our approach.
“…For instance, some works have addressed early text classification by using diverse techniques like modifications of Naive Bayes (Escalante et al, 2016), profile-based representations (Escalante et al, 2017), and Multi-Resolution Concept Representations (López-Monroy et al, 2018). Those approaches have focused on quantifying prediction performance of the classifiers when using partial information in documents, that is, by considering how well they behave when incremental percentages of documents are provided to the classifier.…”
Section: Analysis Of Sequential Data: Early Classificationmentioning
With the rise of the Internet, there is a growing need to build intelligent systems that are capable of efficiently dealing with early risk detection (ERD) problems on social media, such as early depression detection, early rumor detection or identification of sexual predators. These systems, nowadays mostly based on machine learning techniques, must be able to deal with data streams since users provide their data over time. In addition, these systems must be able to decide when the processed data is sufficient to actually classify users. Moreover, since ERD tasks involve risky decisions by which people's lives could be affected, such systems must also be able to justify their decisions. However, most standard and state-of-the-art supervised machine learning models (such as SVM, MNB, Neural Networks, etc.) are not well suited to deal with this scenario. This is due to the fact that they either act as black boxes or do not support incremental classification/learning. In this paper we introduce SS3, a novel supervised learning model for text classification that naturally supports these aspects. SS3 was designed to be used as a general framework to deal with ERD problems. We evaluated our model on the CLEF's eRisk2017 pilot task on early depression detection. Most of the 30 contributions submitted to this competition used state-of-the-art methods. Experimental results show that our classifier was able to outperform these models and standard classifiers, despite being less computationally expensive and having the ability to explain its rationale.
“…The key aspect of the work is a Markov Decision Process (MDP), where each sentence is modeled in a TFIDF vector. More recently, (Escalante et al, 2016) proposed a straightforward solution for early detection scenarios by using the naïve Bayes classifier. The idea consists in training with full documents, but when partial information has to be classified, the maximum a posteriori probability was estimated over the available text.…”
This paper proposes a novel document representation, called Multi-Resolution Representation (MulR), to improve the early detection of risks in social media sources. The goal is to effectively identify the potential risk using as little evidence as possible and with as much anticipation as possible. MulR allows us to generate multiple "views" of the text. These views capture different semantic meanings for words and documents at different levels of granularity, which is very useful in early scenarios to model the variable amounts of evidence. The experimental evaluation shows that MulR using low resolution is better suited for modeling short documents (very early stages), whereas large documents (medium/late stages) are better modeled with higher resolutions. We evaluate the proposed ideas in two different tasks where anticipation is critical: sexual predator detection and depression detection. The experimental evaluation for these early tasks revealed that the proposed approach outperforms previous methodologies by a considerable margin.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.