Disentangling from Babylonian Confusion – Unsupervised Language Identification

Biemann, Chris; Teresniak, Sven

doi:10.1007/978-3-540-30586-6_87

Cited by 12 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A different partial solution to the issue of unseen languages is to design the classifier to be able to output "unknown" as a prediction for language. This helps to alleviate one of the problems commonly associated with the presence of unseen languages -classifiers without an "unknown" facility are forced to pick a language for each document, and in the case of unseen languages, the choice may be arbitrary and unpredictable (Biemann and Teresniak, 2005). When LI is used for filtering purposes, i.e.…”

Section: "Unseen" Languages and Unsupervised LImentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: "Unseen" Languages and Unsupervised LImentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

show abstract

“…Shiells and Pham (2010) incorporate what they call the "purity" and "authority" into Chinese Whispers to identify the language of one million short Tweets. They find that the algorithm does not seem to converge when using Twitter data as opposed to when using much longer documents (Biemann and Teresniak, 2005;Biemann, 2006) due to many short Tweets mixing words from more than one language. That is there are many more edges between language clusters.…”

Section: Background and Problem Analysismentioning

confidence: 96%

“…Amine et al (2010) demonstrate an approach using similarity measures, but performance is greatly reduced when compared to supervised methods. Biemann and Teresniak (2005) present a promising co-occurrence words graph approach, namely Chinese Whispers (Biemann, 2006), claiming an F1 score of 99%, but their work focuses on long documents (each language present must have a minimum of 100 sentences). Shiells and Pham (2010) incorporate what they call the "purity" and "authority" into Chinese Whispers to identify the language of one million short Tweets.…”

Section: Background and Problem Analysismentioning

confidence: 99%

Unsupervised language identification based on Latent Dirichlet Allocation

Zhang

Clark

Wang

et al. 2016

Computer Speech & Language

View full text Add to dashboard Cite

“…(Kruengkrai et al, 2005) proposed a feature based on alignment of string kernels using suffix trees, and used it in two different classifiers. Finally, (Biemann and Teresniak, 2005) presented an unsupervised system that clusters the words based on sentence co-occurence.…”

Section: Related Workmentioning

confidence: 99%

Language identification of search engine queries

Ceylan

Kim

2009

Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural

View full text Add to dashboard Cite

We consider the language identification problem for search engine queries. First, we propose a method to automatically generate a data set, which uses clickthrough logs of the Yahoo! Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. Next, we use this data set to train two decision tree classifiers; one that only uses linguistic features and is aimed for textual language identification, and one that additionally uses a non-linguistic feature, and is geared towards the identification of the language intended by the users of the search engine. Our results show that our method produces a highly reliable data set very efficiently, and our decision tree classifier outperforms some of the best methods that have been proposed for the task of written language identification on the domain of search engine queries.

show abstract

Disentangling from Babylonian Confusion – Unsupervised Language Identification

Cited by 12 publications

References 6 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

Unsupervised language identification based on Latent Dirichlet Allocation

Language identification of search engine queries

Contact Info

Product

Resources

About