2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) 2015
DOI: 10.1109/synasc.2015.45
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Language Identification for Romance Languages Using Stop Words and Diacritics

Abstract: Automatic language identification is a natural language processing problem that tries to determine the natural language of a given content. In this paper we present a statistical method for automatic language identification of written text using dictionaries containing stop words and diacritics. We propose different approaches that combine the two dictionaries to accurately determine the language of textual corpora. This method was chosen because stop words and diacritics are very specific to a language, altho… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 5 publications
0
4
0
Order By: Relevance
“…The simplest Language Identification methods discriminate using elementary distinguishing traits like unique character combinations, frequent or unique words, diacritics, or common n-grams (Dunning, 1994;Souter et al, 1994;Truicȃ et al, 2015). Increasing model complexity, some Language Identification methods model sequences of words, characters, or bytes.…”
Section: Related Workmentioning
confidence: 99%
“…The simplest Language Identification methods discriminate using elementary distinguishing traits like unique character combinations, frequent or unique words, diacritics, or common n-grams (Dunning, 1994;Souter et al, 1994;Truicȃ et al, 2015). Increasing model complexity, some Language Identification methods model sequences of words, characters, or bytes.…”
Section: Related Workmentioning
confidence: 99%
“…Word level classification has been the trend for quite a time (Dutta et al, 2015;Banerjee et al, 2015) but most of the work does not concern the working of Twitter as the tweets are generally short texts in a single language, hence depriving the need of code mixing. Truica et al (2015) had proposed a statistical method, that detect and classify Twitter data and news articles based on dictionary of stop words and diacritics automatically. The limitation of the above technique is languages might have common words, at that time the detection will be not correct, but such problems are mitigated to a large extend in our technique.…”
Section: Unsuccessful Detectionmentioning
confidence: 99%
“…There are different types of approaches for language identification of textual document including the character n-gram, words with dictionaries of various languages and use language stop words as backlist (Truica et al 2015). Due to nature of spell checker, which detects and corrects misspellings at phrase or sentence level it is not effective to adopt the above techniques of language identification.…”
Section: Language Selectionmentioning
confidence: 99%