Multi-class Document Classification Using Improved Word Embeddings

Rabut, Benedict A.; Fajardo, Arnel C.; Medina, Ruji P.

doi:10.1145/3366650.3366661

Cited by 7 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the total vocabulary with Word2Vec is around 15.8 K, fastText has only 4.7 K sub-words. Also, it only shows a 0.5%-1% improvement, as given by Benedict et al [56]. Therefore, we reverted to the older Word2Vec approach for pre-training the WE model as it is easier to transfer the embedding matrix weights between pre-trained and actual models.…”

Section: Discussionmentioning

confidence: 99%

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Context and Transfer Learning

Rohan

Hubig

Min

et al. 2023

Preprint

View full text Add to dashboard Cite

Interoperable clinical decision support system (CDSS) rules are a pathway to achieving interoperability which is a well-recognized challenge in health information technology. Building an ontology facilitates the creation of interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. However, KP identification for labeling the data requires human expertise, consensus, and contextual understanding. This paper aims to present a semi-supervised framework for the CDSS using minimal labeled data based on hierarchical attention over the documents fused with domain adaptation approaches. Then, evaluate the effectiveness of KP identification with this framework. In the view of semi-supervised learning, our methodology toward building this framework outperforms the prior neural architectures by learning with document-level context, no explicit hand-crafted features, knowledge transfer from pre-trained models (on unlabeled corpus), and post-fine-tuning with smaller gold standard-labeled data. To the best of our knowledge, this is the first functional framework for the CDSS sub-domain to identify the KP, which is trained on limited labeled data. It contributes to the general natural language processing (NLP) architectures in areas such as clinical NLP, where manual data labeling is challenging.

show abstract

Section: Discussionmentioning

confidence: 99%

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Context and Transfer Learning

Rohan

Hubig

Min

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The main objective of a document classification task is to allocate each item of a text corpus to one or more categories depending on whether it is a multiclass or a multilabel classification task [2]. Based on the training data, the system can classify previously unseen items to their corresponding categories.…”

Section: Related Work a Multi-class Document Classificationmentioning

confidence: 99%

“…[1] proposed a pairwise multiclass document classification approach for identifying relationships between Wikipedia articles, and SCVD-MS was presented in [6], which utilized multisense embeddings to improve multiclass classification on the 20NewsGroup dataset 1 while also targeting a lower dimensional representation compared to that of its predecessor SCDV. The 20NewsGroup dataset was also utilized in [2], in which an extension to the Word2Vec and FastText [12] word embedding algorithms is proposed. The word embeddings were augmented with semantic information by assigning a part-of-speech (POS) tag to each word with the objective of evaluating the enhanced model's performance on a multiclass classification task.…”

Section: Related Work a Multi-class Document Classificationmentioning

confidence: 99%

An AI-Based Methodology for the Automatic Classification of a Multiclass Ebook Collection Using Information From the Tables of Contents

Giannopoulou

Mitrou

2020

IEEE Access

View full text Add to dashboard Cite

Book recommendation to support professors and students in the identification of relevant sources is of significant importance for both universities and digital libraries and, hence, motivates the development of a recommendation system. This paper aims at automatically classifying a multiclass corpus that was created from ebooks from the Springer collection, which is available through the Hellenic Academic Libraries' subscription, by utilizing an unsupervised neural network (NN) (self-organizing maps, SOM) and two deep neural network (DNN) architectures, namely, a long short-term memory (LSTM) and a convolutional neural network (CNN) combined with a LSTM(CNN+LSTM) under various configuration scenarios. The vector construction leverages information that was extracted from the table of contents (ToC) of each book using the TF-IDF weighting scheme (for the first case) and the Keras tokenizer (for the second). Extensive experiments were conducted using various configurations of preprocessing steps, NN set up and vector and vocabulary sizes to assess their impact on the classifier's performance. Furthermore, we show that majority voting is more suitable for selecting the dominant label for a specified node. The experimental analysis showed the feasibility of developing a recommendation system for supporting professors and students in the identification of related sources based on a detailed thematic description (e.g., abstract or table of contents of a book) rather than a few keywords. In the conducted experiments, the subsystem that utilized the DNN (LSTM) performed the best, with F1-scores of 67% for the 26 categories and 80% for the 5 general categories, whereas SOM realizes F1-scores of less than 5% in both cases.

show abstract

“…This method is particularly efficient at gathering semantic and contextual connections between words due to its training on a vast corpus of data. Pre-trained word embeddings can improve the accuracy of learning models when there is consistency between the data domain and the corpus used for training (ALRashdi and O'Keefe, 2019;Asudani et al, 2023;Rabut et al, 2019). Conversely, custom-trained embedding is solely trained using specified datasets (Sabbeh and Fasihuddin, 2023).…”

Section: Introductionmentioning

confidence: 99%

LSTM-CNN Hybrid Model Performance Improvement with BioWordVec for Biomedical Report Big Data Classification

Kurniasari,

Warsono,

Usman

et al. 2024

Sci. Technol. Indones

View full text Add to dashboard Cite

The rise in mortality rates due to leukemia has fueled the swift expansion of publications concerning the disease. The increase in publications has dramatically affected the enhancement of biomedical literature, further complicating the manual extraction of pertinent material on leukemia. Text classification is an approach used to retrieve pertinent and top-notch information from the biomedical literature. This research suggests employing an LSTM-CNN hybrid model to tackle imbalanced data classification in a dataset of PubMed abstracts centred on leukemia. Random Undersampling and Random Oversampling techniques are merged to tackle the data imbalance problem. The classification model’s performance is improved by utilizing a pre trained word embedding created explicitly for the biomedical domain, BioWordVec. Model evaluation indicates that hybrid resampling techniques with domain-specific pre-trained word embeddings can enhance model performance in classification tasks, achieving accuracy, precision, recall, and f1-score of 99.55%, 99%, 100%, and 99%, respectively. The results suggest that this research could be an alternative technique to help obtain information about leukemia.

show abstract

Multi-class Document Classification Using Improved Word Embeddings

Cited by 7 publications

References 14 publications

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Context and Transfer Learning

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Context and Transfer Learning

An AI-Based Methodology for the Automatic Classification of a Multiclass Ebook Collection Using Information From the Tables of Contents

LSTM-CNN Hybrid Model Performance Improvement with BioWordVec for Biomedical Report Big Data Classification

Contact Info

Product

Resources

About