A Complete Process of Text Classification System Using State-of-the-Art NLP Models

Dogra, Varun; Verma, Sahil; Kavita, Kavita; Chatterjee, Pushpita; Shafi, Jana; Choi, Jaeyoung; Ijaz, Muhammad Fazal

doi:10.1155/2022/1883698

Cited by 62 publications

(21 citation statements)

References 134 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researchers have demonstrated the adaptability of Word2Vec and BERT in the feld of biomedical domain to develop models such as BioWordVec [33] and BioBERT [34], as well as other domain-specifc models such as SciBERT [35] trained on various scientifc and biomedical corpuses, ClinicalBERT [36] trained on clinical notes for various NLP tasks, and MatSciBERT [37] trained on material science publications. Deep learning models that take such trained word representations as input have been employed by researchers to classify unstructured texts documents [38], medical notes [39], health-related social media texts [40], and biomedical text mining tasks [41]. Besides these, handwritten script recognition [42], detection of diseases [43][44][45], and healthcare solutions [46] involve the potential application of deep learning models.…”

Section: Related Workmentioning

confidence: 99%

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts

Parwez

Fazil

Arif

et al. 2023

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.

show abstract

Section: Related Workmentioning

confidence: 99%

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts

Parwez

Fazil

Arif

et al. 2023

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…The model only required five times training using the loss function binary cross-entropy (BCE) [27] using formula (4).…”

Section: A Trainingmentioning

confidence: 99%

“…Six labels in a text that correspond to the basic emotions are used in the classification process [3]. Various techniques, including support vector machine (SVM), naïve Bayes, random forest, convolutional neural networks, have been used in numerous prior research [4]- [7]. The cross-lingual language modelrobustly optimized bidirectional encoder representations from transformers approach (XLM-RoBERTa) model could improve the classification performance of a hate speech text in Indonesian to 89.52%, compared with the previous research using long short term memory (LSTM), which only reached 77.36% optimization [1].…”

mentioning

confidence: 99%

Mengoptimalkan Akurasi pada Klasifikasi Emosi Majemuk Berdasarkan Semantik Kalimat Menggunakan XLM-RoBERTa

Aripin

Santoso²,

Haryanto³

2023

JNTETI

View full text Add to dashboard Cite

Emosi dasar dibagi menjadi enam, yaitu marah, sedih, senang, jijik, kaget, dan takut. Gabungan lebih dari satu emosi dasar dapat menciptakan sebuah emosi baru, yaitu emosi majemuk. Emosi majemuk dapat diimplementasikan untuk chat-bot, penerjemahan bahasa, text summarization, dan sebagainya. Penelitian mengenai klasifikasi emosi berdasarkan teks bahasa Indonesia telah banyak dilakukan dengan menggunakan beberapa model tradisional, seperti multinomial naïve Bayes, SVM, k-nearest neighborhood, dan term frequency–inverse document frequency (TF-IDF). Penelitian tersebut memiliki kelemahan, antara lain kinerja yang kurang optimal karena model hanya dapat mengklasifikasi dari data yang telah dipelajarinya, diperlukan pemrosesan teks terlebih dahulu, dan diperlukannya waktu yang lama dalam proses pelatihan dengan data berukuran besar. Penelitian ini bertujuan untuk mengatasi beberapa kelemahan penelitian sebelumnya dengan menggunakan model cross-lingual language model-robustly optimized bidirectional encoder representations from transformers approach (XML-RoBERTa) untuk mengklasifikasi emosi majemuk berdasarkan semantik atau makna kalimat dan kata. XLM-RoBERTa merupakan sebuah model transformer yang dapat mengetahui sebuah makna kata dari attention mechanism pada kata tersebut dan merupakan sebuah vektor yang merepresentasikan sebuah konteks atau makna kata. Attention mechanism merupakan sebuah representasi kata berbentuk vektor untuk mengetahui penggunaan dan posisi kata pada suatu kalimat dan merupakan cara agar model dapat mengetahui makna dari sebuah kata. Dengan attention mechanism, model dapat melihat pola kalimat dari penggunaan kata dan mengklasifikasikan kalimat tersebut sesuai dengan pola dan urutan kata, sehingga semantik kalimat dapat diketahui. Hasil eksperimen menunjukkan bahwa model yang diusulkan mampu mengklasifikasi teks berbahasa Indonesia ke dalam kelas-kelas emosi dasar dan kombinasinya sebagai dasar pembentukan emosi majemuk dengan akurasi sebesar 95,56%. Nilai akurasi ini merupakan nilai akurasi yang lebih unggul dibandingkan dengan penelitian klasifikasi kelas emosi majemuk dengan menggunakan model tradisional.

show abstract

“…The average achieved accuracy was 83.21%, the average F-positive rate was 10.03%, and the average F-measure was 86%. Similarly, Dogra et al [ 48 ] also considered the pattern of the URLs posting in twitter social media platforms by analyzing the behavior of the URLs posting users and URLs clicking users. Using twitter APIs combined with Bitly APIs they collected around 7 million tweets that contain shortened URLs created by Bitly and tried different sets of features including Average clicks, Posting count, Median followers, Median friends, Score function Score Category.…”

Section: Literature Reviewmentioning

confidence: 99%

An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models

Aljabri

Alhaidari

Mohammad

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

The World Wide Web services are essential in our daily lives and are available to communities through Uniform Resource Locator (URL). Attackers utilize such means of communication and create malicious URLs to conduct fraudulent activities and deceive others by creating deceptive and misleading websites and domains. Such threats open the doors for many critical attacks such as spams, spyware, phishing, and malware. Therefore, detecting malicious URL is crucially important to prevent the occurrence of many cybercriminal activities. In this study, we examined a set of machine learning (ML) and deep learning (DL) models to detect malicious websites using a dataset comprising 66,506 records of URLs. We engineered three different types of features including lexical-based, network-based and content-based features. To extract the most discriminative features in the dataset, we applied several features selection algorithms, namely, correlation analysis, Analysis of Variance (ANOVA), and chi-square. Finally, we conducted a comparative performance evaluation for several ML and DL models considering set of criteria commonly used to evaluate such models. Results depicted that Naïve Bayes (NB) was the best model for detecting malicious URLs using the applied data with an accuracy of 96%. This research has made contribution to the field by conducting significant features engineering and analysis to identify the best features for malicious URLs predictions, compare different models and achieve a high accuracy using a large new URL dataset.

show abstract

A Complete Process of Text Classification System Using State-of-the-Art NLP Models

Cited by 62 publications

References 134 publications

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts

Mengoptimalkan Akurasi pada Klasifikasi Emosi Majemuk Berdasarkan Semantik Kalimat Menggunakan XLM-RoBERTa

An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models

Contact Info

Product

Resources

About