On Efficient Meta-Level Features for Effective Text Classification

Canuto, Sérgio D.; Salles, Thiago; Gonçalves, Marcos André; Rocha, Leonardo; Ramos, Grasieli de Oliveira; Gonçalves, Luiz Fernando; Rosa, Thierson Couto; Martins, Wellington Santos

doi:10.1145/2661829.2662060

Cited by 13 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As described below in Sections 3.1 and 3.2 below, we use binbased features to capture the characteristics of the differences between vectors and the distribution of word embeddings. This is similar to, e.g., [11] where meta-level features are proposed, in a text classification setting using the kNN algorithm, to exploit the distribution of the nearest neighbour similarities and the withinclass cohesion.…”

Section: Meta-level Featuresmentioning

confidence: 86%

Short Text Similarity with Word Embeddings

Kenter

Rijke

2015

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

353

240

View full text Add to dashboard Cite

Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length. We investigate whether determining short text similarity is possible using only semantic features-where by semantic we mean, pertaining to a representation of meaning-rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity. We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts. We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.

show abstract

Section: Meta-level Featuresmentioning

confidence: 86%

Short Text Similarity with Word Embeddings

Kenter

Rijke

2015

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

353

240

View full text Add to dashboard Cite

show abstract

“…In other words, the classifier used to predict the class of documents was not used in the construction phase of the document representation. In terms of text representations, we considered three alternatives, namely traditional term-weighting alternatives (term frequency-inverted document frequency [TFIDF]); weighting based on word and character (n-gram) frequency; and recent representations based on meta-features, which capture statistical information from a document's neighborhood and have obtained state-of-the-art effectiveness in recent benchmarks [35][36][37][38][39].…”

Section: Automatic Text Classification Methodsmentioning

confidence: 99%

“…In contrast, it is heavily dependent on the specialists and the coverage of the rules on the text expressions. More details about each of the exploited algorithms are provided in Multimedia Appendix 4 [3,35,37,39,[41][42][43][44][45][50][51][52][53][54][55][56][57][58][59][60][61][62][63].…”

Section: Automatic Text Classification Methodsmentioning

confidence: 99%

Stroke Outcome Measurements From Electronic Medical Records: Cross-sectional Study on the Effectiveness of Neural and Nonneural Classifiers

et al. 2021

Self Cite

View full text Add to dashboard Cite

Background With the rapid adoption of electronic medical records (EMRs), there is an ever-increasing opportunity to collect data and extract knowledge from EMRs to support patient-centered stroke management. Objective This study aims to compare the effectiveness of state-of-the-art automatic text classification methods in classifying data to support the prediction of clinical patient outcomes and the extraction of patient characteristics from EMRs. Methods Our study addressed the computational problems of information extraction and automatic text classification. We identified essential tasks to be considered in an ischemic stroke value-based program. The 30 selected tasks were classified (manually labeled by specialists) according to the following value agenda: tier 1 (achieved health care status), tier 2 (recovery process), care related (clinical management and risk scores), and baseline characteristics. The analyzed data set was retrospectively extracted from the EMRs of patients with stroke from a private Brazilian hospital between 2018 and 2019. A total of 44,206 sentences from free-text medical records in Portuguese were used to train and develop 10 supervised computational machine learning methods, including state-of-the-art neural and nonneural methods, along with ontological rules. As an experimental protocol, we used a 5-fold cross-validation procedure repeated 6 times, along with subject-wise sampling. A heatmap was used to display comparative result analyses according to the best algorithmic effectiveness (F1 score), supported by statistical significance tests. A feature importance analysis was conducted to provide insights into the results. Results The top-performing models were support vector machines trained with lexical and semantic textual features, showing the importance of dealing with noise in EMR textual representations. The support vector machine models produced statistically superior results in 71% (17/24) of tasks, with an F1 score >80% regarding care-related tasks (patient treatment location, fall risk, thrombolytic therapy, and pressure ulcer risk), the process of recovery (ability to feed orally or ambulate and communicate), health care status achieved (mortality), and baseline characteristics (diabetes, obesity, dyslipidemia, and smoking status). Neural methods were largely outperformed by more traditional nonneural methods, given the characteristics of the data set. Ontological rules were also effective in tasks such as baseline characteristics (alcoholism, atrial fibrillation, and coronary artery disease) and the Rankin scale. The complementarity in effectiveness among models suggests that a combination of models could enhance the results and cover more tasks in the future. Conclusions Advances in information technology capacity are essential for scalability and agility in measuring health status outcomes. This study allowed us to measure effectiveness and identify opportunities for automating the classification of outcomes of specific tasks related to clinical conditions of stroke victims, and thus ultimately assess the possibility of proactively using these machine learning techniques in real-world situations.

show abstract

“…There is an ongoing debate in the research community if additional features can improve the simple bag-of-words model. Some authors find significant improvements (Canuto et al 2014), and others assert that NLP-derived features are about as good as bag-of-words (Godbole 2006). It is a fact that due to the predictive power of bag-of-words and bag-of-n-grams and their ease-of-use, especially in the predominant case of sentiment analysis, little research has been devoted to the investigation of more complex, NLP-based features.…”

Section: Figure 3: Architecture and Process Flowmentioning

confidence: 99%

Mining User-Generated Repair Instructions from Automotive Web Communities

Wambsganss¹,

Fromm²

2019

Proceedings of the Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

The objective of this research was to automatically extract user-generated repair instructions from large amounts of web data. An artifact has been created that classifies a web post as containing a repair instruction or not. Methods from Natural Language Processing are used to transform the unstructured textual information from a web post into a set of numerical features that can be further processed by different Machine Learning Algorithms. The main contribution of this research lies in the design and prototypical implementation of these features. The evaluation shows that the created artifact can accurately distinguish posts containing repair instructions from other posts e.g. containing problem reports. With such a solution, a company can save a lot of time and money that was previously necessary to perform this classification task manually.

show abstract

On Efficient Meta-Level Features for Effective Text Classification

Cited by 13 publications

References 17 publications

Short Text Similarity with Word Embeddings

Short Text Similarity with Word Embeddings

Stroke Outcome Measurements From Electronic Medical Records: Cross-sectional Study on the Effectiveness of Neural and Nonneural Classifiers

Mining User-Generated Repair Instructions from Automotive Web Communities

Contact Info

Product

Resources

About