Text-based language identification for some of the under-resourced languages of South Africa

Sefara, Tshephisho Joseph; Manamela, Madimetja Jonas; Malatji, Promise Tshepiso

doi:10.1109/icacce.2016.8073765

Cited by 7 publications

(3 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Combining signals form other system in addition to acoustics is a good way to further boost the performance of LangID accuracy, such as text-based features, language model features [14,15]. In this work, we have tried both lattice based methods and neural network method to combine text-based semantic features and acoustic features to improve the accuracy of language identification.…”

Section: Related Workmentioning

confidence: 99%

Signal Combination for Language Identification

Wang,

Wan,

et al. 2019

Preprint

View full text Add to dashboard Cite

Google's multilingual speech recognition system combines low-level acoustic signals with language-specific recognizer signals to better predict the language of an utterance. This paper presents our experience with different signal combination methods to improve overall language identification accuracy. We compare the performance of a lattice-based ensemble model and a deep neural network model to combine signals from recognizers with that of a baseline that only uses low-level acoustic signals. Experimental results show that the deep neural network model outperforms the lattice-based ensemble model, and it reduced the error rate from 5.5% in the baseline to 4.3%, which is a 21.8% relative reduction.

show abstract

Section: Related Workmentioning

confidence: 99%

Signal Combination for Language Identification

Wang,

Wan,

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Multiple papers have proposed hierarchical stacked classifiers (including lexicons) that would for example first classify a piece of text by language group and then by exact language [19,20,9,1]. Some work has also been done on classifying surnames between Tshivenda, Xitsonga and Sepedi [21]. Additionally, data augmentation [22] and adversarial training [23] approaches are potentially very useful to reduce the requirement for data.…”

Section: Introductionmentioning

confidence: 99%

Improved text language identification for the South African languages

Duvenhage

Ntini

Ramonyai

2017

2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech)

View full text Add to dashboard Cite

The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. The algorithm is compared to recent approaches using test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) shared tasks' datasets. Remaining research opportunities and pressing concerns in evaluating and comparing LID approaches are also discussed.

show abstract

“…L ANGUAGE. Identification (LID)plays a critical role in the expansive field of computational linguistics, essential for interpreting multilingual texts [5], [6]. This task assumes greater significance in today's interconnected global environment, particularly in linguistically diverse regions such as India, with its array of languages, scripts, and dialects [2], [3].…”

mentioning

confidence: 99%

BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages

Dey,

Thakur,

Kandwal

et al. 2024

IEEE Access

View full text Add to dashboard Cite

In the rapidly globalizing digital communication sphere, the imperative for advanced multilingual text recognition and identification is increasingly evident. Contrasting the previous works, which were predominantly constrained to 2-3 languages, this paper explores the rich linguistic diversity of India, addressing challenges in automated language processing for 12 languages. BharatBhasaNet, our comprehensive Language Identification (LID) framework, integrates an extensive dataset covering these 12 Indian languages in both native-script and romanized forms, derived from INDICCORP [30], Bhasha-Abhijnaanam [20], and Aksharantar [32] datasets by AI4Bharat a . The framework accommodates two models, Roberta-native and Roberta-Romanized, based on attention mechanism and transformer architecture. With its exceptional accuracy of 99.54% in native script and 60.90% in Romanized text, BharatBhasaNet significantly advances language identification, providing broader language coverage than existing LIDs. It excels in interpreting code-mixed sentences , unveiling crucial accuracy patterns related to sentence length, word span, and complexity in multilingual contexts. The framework underwent rigorous testing using a real-time dataset from the National Informatics Center (NIC), achieving an accuracy rate of 92.67%. Overcoming challenges like limited training data and distinguishing similar languages, BharatBhasaNet marks a significant leap in Romanized text identification within diverse linguistic landscapes.a https://ai4bharat.iitm.ac.in

show abstract

Text-based language identification for some of the under-resourced languages of South Africa

Cited by 7 publications

References 13 publications

Signal Combination for Language Identification

Signal Combination for Language Identification

Improved text language identification for the South African languages

BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages

Contact Info

Product

Resources

About