Automatic Language Identification for Romance Languages Using Stop Words and Diacritics

Truică, Ciprian-Octavian; Velcin, Julien; Boicea, Alexandru

doi:10.1109/synasc.2015.45

Cited by 7 publications

(4 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The simplest Language Identification methods discriminate using elementary distinguishing traits like unique character combinations, frequent or unique words, diacritics, or common n-grams (Dunning, 1994;Souter et al, 1994;Truicȃ et al, 2015). Increasing model complexity, some Language Identification methods model sequences of words, characters, or bytes.…”

Section: Related Workmentioning

confidence: 99%

A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

Toftrup¹,

Sørensen²,

Ciosici³

et al. 2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research W

View full text Add to dashboard Cite

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

show abstract

Section: Related Workmentioning

confidence: 99%

A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

Toftrup¹,

Sørensen²,

Ciosici³

et al. 2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research W

View full text Add to dashboard Cite

show abstract

“…Word level classification has been the trend for quite a time (Dutta et al, 2015;Banerjee et al, 2015) but most of the work does not concern the working of Twitter as the tweets are generally short texts in a single language, hence depriving the need of code mixing. Truica et al (2015) had proposed a statistical method, that detect and classify Twitter data and news articles based on dictionary of stop words and diacritics automatically. The limitation of the above technique is languages might have common words, at that time the detection will be not correct, but such problems are mitigated to a large extend in our technique.…”

Section: Unsuccessful Detectionmentioning

confidence: 99%

Mining multilingual and multiscript Twitter data: unleashing the language and script barrier

Sarkar

Sinhababu²,

Roy

et al. 2020

IJBIDM

View full text Add to dashboard Cite

Micro-blogging sites like Twitter have become an opinion hub where views on diverse topics are expressed. Interpreting, comprehending and analysing this emotion-rich information can unearth many valuable insights. The job is trivial if the tweets are in English. But lately, increase in native languages for communication has imposed a great challenge in social media mining. Things become more complicated when people use Roman scripts to write non-English languages. India, being a country with a diverse collection of scripts and languages, encounters the problem severely. We have developed a system that automatically identifies and classifies native tweets, irrespective of the script used. Converting all tweets to English, we get rid of the 'script vs language' problem. The new approach we formulated consists of Script Identification, Language analysis, and Clustered mining. Considering English and the top two Indian languages, we found that the proposed framework gives better precision than the prevailing approaches.

show abstract

“…There are different types of approaches for language identification of textual document including the character n-gram, words with dictionaries of various languages and use language stop words as backlist (Truica et al 2015). Due to nature of spell checker, which detects and corrects misspellings at phrase or sentence level it is not effective to adopt the above techniques of language identification.…”

Section: Language Selectionmentioning

confidence: 99%

Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

2020

JIEA

View full text Add to dashboard Cite

Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good. The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively.

show abstract

Automatic Language Identification for Romance Languages Using Stop Words and Diacritics

Cited by 7 publications

References 5 publications

A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

Mining multilingual and multiscript Twitter data: unleashing the language and script barrier

Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

Contact Info

Product

Resources

About