Improved text language identification for the South African languages

Duvenhage, Bernardt; Ntini, Mfundo; Ramonyai, Phala

doi:10.1109/robomech.2017.8261150

Cited by 8 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the context of NLP, Almutir and Nadeem [4] use confusion matrices to evaluate the performance of named-entity recognition systems by analyzing the discrepancies between predicted and actual entity labels. Pienaar and Snyman use them for the identification of eleven official South African languages [5]. Abandah et al [6] use confusion matrix to correct spelling mistakes in Arabic with insufficient datasets to train the correction models.…”

Section: Confusion Matrixmentioning

confidence: 99%

Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian Language

Gledec,

Sokele,

Horvat

et al. 2024

Computers

View full text Add to dashboard Cite

This paper introduces a novel approach to the creation and application of confusion matrices for error pattern discovery in spellchecking for the Croatian language. The experimental dataset has been derived from a corpus of mistyped words and user corrections collected since 2008 using the Croatian spellchecker available at ispravi.me. The important role of confusion matrices in enhancing the precision of spellcheckers, particularly within the diverse linguistic context of the Croatian language, is investigated. Common causes of spelling errors, emphasizing the challenges posed by diacritic usage, have been identified and analyzed. This research contributes to the advancement of spellchecking technologies and provides a more comprehensive understanding of linguistic details, particularly in languages with diacritic-rich orthographies, like Croatian. The presented user-data-driven approach demonstrates the potential for custom spellchecking solutions, especially considering the ever-changing dynamics of language use in digital communication.

show abstract

Section: Confusion Matrixmentioning

confidence: 99%

Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian Language

Gledec,

Sokele,

Horvat

et al. 2024

Computers

View full text Add to dashboard Cite

show abstract

“…South African languages. This dataset was used by Duvenhage et al (2017) and consists of short texts for 11 South African languages, many of which are related; these are the Nguni languages (zul, xho, nbl, ssw), Afrikaans (afr) and English (en), the Sotho languages (nso, sot, tsn), tshiVenda (ven) and Xitsonga (tso). The texts are on average 15-20 characters long.…”

Section: Datamentioning

confidence: 99%

Unsupervised Deep Language and Dialect Identification for Short Texts

Goswami¹,

Sarkar

Chakravarthi

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three shorttext datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

show abstract

“…South African native languages do not have enough resources to be used to built such contextual Chatbots and Virtual Text Assistant. Therefore, the resources for native languages need to be created so that they can be used to build software agents that understand South African native languages (Duvenhage et al 2017).…”

Section: Introductionmentioning

confidence: 99%

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Madodonga

Marivate

Adendorff³

2023

JDHASA

View full text Add to dashboard Cite

Local/Native South African languages are classified as low-resource languages. As such, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these native South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Bag-Of-Words, TFIDF and Word2vec. The results of this study showed that XGBoost, Logistic Regression and LSTM, trained from Word2vec performed better than the other combinations.

show abstract

Improved text language identification for the South African languages

Cited by 8 publications

References 22 publications

Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian Language

Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian Language

Unsupervised Deep Language and Dialect Identification for Short Texts

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Contact Info

Product

Resources

About