Evaluation of Statistical Text Normalisation Techniques for Twitter

Sosamphan, Phavanh; Liesaputra, Veronica; Yongchareon, Sira; Mohaghegh, Mahsa

doi:10.5220/0006083004130418

Cited by 2 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 5 lists the number of correct, incorrect, and non-normalized words. These same 300 words were then put through the normalization methods proposed by [ 12 , 39 ]. Both these models were able to use the regular expression method and a spell-check algorithm to normalize OOV words with repeated letters resulting in impressive outcomes.…”

Section: Experimental Evaluation and Resultsmentioning

confidence: 99%

“…Both these models were able to use the regular expression method and a spell-check algorithm to normalize OOV words with repeated letters resulting in impressive outcomes. Table 5 provides a comparison of the outcomes of the RBPsWRL- Sym model and that of the normalization models proposed by [ 12 , 39 ]. As seen in Fig 20 , the RBPsWRL-Sym model increased the F1 score from 78% and 81% to 88%.…”

Section: Experimental Evaluation and Resultsmentioning

confidence: 99%

“…As seen in Fig 20 , the RBPsWRL-Sym model increased the F1 score from 78% and 81% to 88%. Therefore, according to the F1 scores, the RBPsWRL-Sym model performed 9% better than the normalization model by [ 12 ] and 13% better than that of [ 39 ] (Figs 20 and 21 ). Both these methods truncate repeated letters followed by a spelling correction algorithm.…”

Section: Experimental Evaluation and Resultsmentioning

confidence: 99%

“…However, the language incorporated into Twitter and various social media networks has evolved [ 10 ], whereby most users use slang words when writing. The use of slang-style writing has increased the use of out-of-vocabulary (OOV) words, including misspelled words, emoticons, abbreviations, and words with repeated letters [ 11 , 12 ].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction

Mansur,

Omar,

Tiun

et al. 2024

PLoS ONE

View full text Add to dashboard Cite

As social media booms, abusive online practices such as hate speech have unfortunately increased as well. As letters are often repeated in words used to construct social media messages, these types of words should be eliminated or reduced in number to enhance the efficacy of hate speech detection. Although multiple models have attempted to normalize out-of-vocabulary (OOV) words with repeated letters, they often fail to determine whether the in-vocabulary (IV) replacement words are correct or incorrect. Therefore, this study developed an improved model for normalizing OOV words with repeated letters by replacing them with correct in-vocabulary (IV) replacement words. The improved normalization model is an unsupervised method that does not require the use of a special dictionary or annotated data. It combines rule-based patterns of words with repeated letters and the SymSpell spelling correction algorithm to remove repeated letters within the words by multiple rules regarding the position of repeated letters in a word, be it at the beginning, middle, or end of the word and the repetition pattern. Two hate speech datasets were then used to assess performance. The proposed normalization model was able to decrease the percentage of OOV words to 8%. Its F1 score was also 9% and 13% higher than the models proposed by two extant studies. Therefore, the proposed normalization model performed better than the benchmark studies in replacing OOV words with the correct IV replacement and improved the performance of the detection model. As such, suitable rule-based patterns can be combined with spelling correction to develop a text normalization model to correctly replace words with repeated letters, which would, in turn, improve hate speech detection in texts.

show abstract

Section: Experimental Evaluation and Resultsmentioning

confidence: 99%

Section: Experimental Evaluation and Resultsmentioning

confidence: 99%

Section: Experimental Evaluation and Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction

Mansur,

Omar,

Tiun

et al. 2024

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…Generally, misspelled words were detected by Natural Language Processing (NLP) systems using the mult-channel models which effectively find the lexical variance on some factors such as contextual wounding of the word, phonetic similarity, orthographic factors, and expansion of acronym using the standard dictionary. As suggested by previous researchers [20][21][22][23], they have utilized the Aspell spell corrector to detect the misspelling on Twitter as well as on SMS datasets.…”

Section: Related Workmentioning

confidence: 99%

Effective Preprocessing and Normalization Techniques for COVID-19 Twitter Streams with POS Tagging via Lightweight Hidden Markov Model

Narayanasamy

Qaisar

et al. 2022

Journal of Sensors

View full text Add to dashboard Cite

The major focus of this research work is to refine the basic preprocessing steps for the unstructured text content and retrieve the potential conceptual features for further enhancement processes such as semantic enrichment and named entity recognition. Although some of the preprocessing techniques such as text tokenization, normalization, and Part-of-Speech (POS) tagging work exceedingly well on formal text, it has not performed well when it is applied into informal text such as tweets and short messages. Hence, we have given the enhanced text normalization techniques to reduce the complexity persist over the twitter streams and eliminate the overfitting issues such as text anomalies and irregular boundaries while fixing the grammar of the text. The hidden Markov model (HMM) has been pervasively used to extract the core lexical features from the Twitter dataset and suitably adapt the external documents to supplement the extraction techniques to complement the tweet context. Using this Markov process, the POS tags are identified as states of the Markov process, and words are the desired results of the model. As this process is very crucial for the next stage of entity extraction and classification, the effective handling of informal text is considered to be important and therefore proposed the most effective hybrid approach to deal with the issues appropriately.

show abstract

Evaluation of Statistical Text Normalisation Techniques for Twitter

Cited by 2 publications

References 6 publications

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction

Effective Preprocessing and Normalization Techniques for COVID-19 Twitter Streams with POS Tagging via Lightweight Hidden Markov Model

Contact Info

Product

Resources

About