Semi-supervised Text Annotation for Hate Speech Detection using K-Nearest Neighbors and Term Frequency-Inverse Document Frequency

Cahyana, Nur Heri; Saifullah, Shoffan; Fauziah, Yuli; Aribowo, Agus Sasmito; Dreżewski, Rafał

doi:10.14569/ijacsa.2022.0131020

Cited by 15 publications

(32 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, it is still difficult to achieve a scoring of an essay from a system similar to that of an expert (teacher). Thus, the next research will be increased by the Latent Semantic Analysis method or NLP [43,44].…”

Section: Comparison Of Essay Assessment Test Results With Writing Err...mentioning

confidence: 99%

Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos

Jayadianti,

Santosa,

Cahyaning

et al. 2023

matrik

View full text Add to dashboard Cite

Writing errors on e-essay exams reduce scores. Thus, detecting and correcting errors automatically in writing answers is necessary. The implementation of Levenshtein Distance and N-Gram can detect writing errors. However, this process needed a long time because of the distance method used. Therefore, this research aims to hybrid Jaro Winker and N-Gram methods to detect and correct writing errors automatically. This process required preprocessing and finding the best word recommendations by the Jaro Winkler method, which refers to Kamus Besar Bahasa Indonesia (KBBI). The N-Gram method refers to the corpus. The final scoring used the Vector Space Model (VSM) method based on the similarity of words between the answer keys and the respondent’s answers. Datasets used 115 answers from 23 respondents with some writing errors. The results of Jaro Winkler and N-Gram methods are good in detecting and correcting Indonesian words with the accuracy of detection averages of 83.64% (minimum of 57.14% and maximum of 100.00%). In contrast, the error correction accuracy averages 78.44% (minimum of 40.00% and maximum of 100.00%). However, Natural Language Processing (NLP) needs to improve these results for word recommendations.

show abstract

Section: Comparison Of Essay Assessment Test Results With Writing Err...mentioning

confidence: 99%

Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos

Jayadianti,

Santosa,

Cahyaning

et al. 2023

matrik

View full text Add to dashboard Cite

show abstract

“…This research focuses on hate speech detection using dataset [20] limited to the realm of politics and law in Indonesia. The dataset includes public opinions from YouTube comments on the presidential debate video [5], and opinions about the COVID-19 pandemic [21]. Several reasons justify considering these comments for further research:…”

Section: Datasetsmentioning

confidence: 99%

“…In skip-grams [46], each neuron specializes in comprehending the context around a single target word, while CBOW predicts the target word from context. The activation function is linear Equation ( 4), and the hidden layer encodes semantic relationships between words Equation (5). The output layer employs softmax to convert outputs into probabilities for accurate prediction Equation (6).…”

Section: Word Embedding (Word2vec)mentioning

confidence: 99%

“…Automatic text annotations can detect hate speech by applying machine learning methods with a semi-supervised learning approach [4,5]. Hate speech data are annotated using two categories (hate and not hate) [6][7][8][9][10], and using sentiment analysis methods, in which data are labeled using two or three categories, namely (positive and negative) [11,12], or (positive, negative, and neutral) [13][14][15][16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection

Saifullah,

Dreżewski,

Dwiyanto

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

Text annotation is an essential element of the natural language processing approaches. The manual annotation process performed by humans has various drawbacks, such as subjectivity, slowness, fatigue, and possibly carelessness. In addition, annotators may annotate ambiguous data. Therefore, we have developed the concept of automated annotation to get the best annotations using several machine-learning approaches. The proposed approach is based on an ensemble algorithm of meta-learners and meta-vectorizer techniques. The approach employs a semi-supervised learning technique for automated annotation to detect hate speech. This involves leveraging various machine learning algorithms, including Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbors (KNN), and Naive Bayes (NB), in conjunction with Word2Vec and TF-IDF text extraction methods. The annotation process is performed using 13,169 Indonesian YouTube comments data. The proposed model used a Stemming approach using data from Sastrawi and new data of 2245 words. Semi-supervised learning uses 5%, 10%, and 20% of labeled data compared to performing labeling based on 80% of the datasets. In semi-supervised learning, the model learns from the labeled data, which provides explicit information, and the unlabeled data, which offers implicit insights. This hybrid approach enables the model to generalize and make informed predictions even when limited labeled data is available (based on self-learning). Ultimately, this enhances its ability to handle real-world scenarios with scarce annotated information. In addition, the proposed method uses a variety of thresholds for matching words labeled with hate speech ranging from 0.6, 0.7, 0.8, to 0.9. The experiments indicated that the DT-TF-IDF model has the best accuracy value of 97.1% with a scenario of 5%:80%:0.9. However, several other methods have accuracy above 90%, such as SVM (TF-IDF and Word2Vec) and KNN (Word2Vec), based on both text extraction methods in several test scenarios.

show abstract

“…So, we used several strategies to find the SSL-Model. Continuing our previous research in [11] [12], we introduce an SSL model for annotating corpus using Naïve Bayes and Random Forest for the classifier model. In our SSL, we use several classifiers that work together but independently to expand the annotated corpus.…”

Section: Introductionmentioning

confidence: 97%

Semi-supervised Learning Models for Sentiment Analysis on Marketplace Dataset

Wisnalmawati

Aribowo

Herawati

2022

Int. J. Artif. Intell. Robot.

Self Cite

View full text Add to dashboard Cite

Sentiment analysis aims to categorize opinions using an annotated corpus to train the model. However, building a high-quality, fully annotated corpus takes a lot of effort, time, and expense. The semi-supervised learning technique efficiently adds training data automatically from unlabeled data. The labeling process, which requires human expertise and requires time, can be helped by an SSL approach. This study aims to develop an SSL-Model for sentiment analysis and to compare the learning capabilities of Naive Bayes (NB) and Random Forest (RF) in the SSL. Our model attempts to annotate opinion documents in Indonesian. We use an ensemble multi-classifier that works on unigrams, bigrams, and trigrams vectors. Our model test uses a marketplace dataset containing rating comments scrapping from Shopee for smartphone products in the Indonesian Language. The research started with data preparation, vectorization using TF-IDF, feature extraction, modeling using Random Forest (RF) and Naïve Bayes (NB), and evaluation using Accuracy and F1-score. The performance of the NB model outperformed previous research, increasing by 5,5%. The conclusion is that SSL performance highly depends on the number of training data and the compatibility of the features or patterns in the document with machine learning. On our marketplace dataset, better to use Random Forest.

show abstract

Semi-supervised Text Annotation for Hate Speech Detection using K-Nearest Neighbors and Term Frequency-Inverse Document Frequency

Cited by 15 publications

References 15 publications

Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos

Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos

Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection

Semi-supervised Learning Models for Sentiment Analysis on Marketplace Dataset

Contact Info

Product

Resources

About