2005
DOI: 10.1007/978-3-540-30586-6_87
|View full text |Cite
|
Sign up to set email alerts
|

Disentangling from Babylonian Confusion – Unsupervised Language Identification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2009
2009
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(10 citation statements)
references
References 6 publications
0
10
0
Order By: Relevance
“…A different partial solution to the issue of unseen languages is to design the classifier to be able to output "unknown" as a prediction for language. This helps to alleviate one of the problems commonly associated with the presence of unseen languages -classifiers without an "unknown" facility are forced to pick a language for each document, and in the case of unseen languages, the choice may be arbitrary and unpredictable (Biemann and Teresniak, 2005). When LI is used for filtering purposes, i.e.…”
Section: "Unseen" Languages and Unsupervised LImentioning
confidence: 99%
“…A different partial solution to the issue of unseen languages is to design the classifier to be able to output "unknown" as a prediction for language. This helps to alleviate one of the problems commonly associated with the presence of unseen languages -classifiers without an "unknown" facility are forced to pick a language for each document, and in the case of unseen languages, the choice may be arbitrary and unpredictable (Biemann and Teresniak, 2005). When LI is used for filtering purposes, i.e.…”
Section: "Unseen" Languages and Unsupervised LImentioning
confidence: 99%
“…Shiells and Pham (2010) incorporate what they call the "purity" and "authority" into Chinese Whispers to identify the language of one million short Tweets. They find that the algorithm does not seem to converge when using Twitter data as opposed to when using much longer documents (Biemann and Teresniak, 2005;Biemann, 2006) due to many short Tweets mixing words from more than one language. That is there are many more edges between language clusters.…”
Section: Background and Problem Analysismentioning
confidence: 96%
“…Amine et al (2010) demonstrate an approach using similarity measures, but performance is greatly reduced when compared to supervised methods. Biemann and Teresniak (2005) present a promising co-occurrence words graph approach, namely Chinese Whispers (Biemann, 2006), claiming an F1 score of 99%, but their work focuses on long documents (each language present must have a minimum of 100 sentences). Shiells and Pham (2010) incorporate what they call the "purity" and "authority" into Chinese Whispers to identify the language of one million short Tweets.…”
Section: Background and Problem Analysismentioning
confidence: 99%
“…(Kruengkrai et al, 2005) proposed a feature based on alignment of string kernels using suffix trees, and used it in two different classifiers. Finally, (Biemann and Teresniak, 2005) presented an unsupervised system that clusters the words based on sentence co-occurence.…”
Section: Related Workmentioning
confidence: 99%