2012
DOI: 10.1016/j.diin.2012.05.004
|View full text |Cite
|
Sign up to set email alerts
|

Finding and identifying text in 900+ languages

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(22 citation statements)
references
References 1 publication
0
21
0
Order By: Relevance
“…They introduced a new method which uses context within the document, and formulated the task as a coreference resolution problem, achieving higher performance than using existing techniques for collections with a large number of languages and small training data. Similar to ODIN, the work by Ralf Brown [7,8] has focused on expanding the number of languages considered simultaneously (developing a language identification system for over 1,100 languages). Alongside these works, the Crubadan Project, led by Kevin Scannell [60], aimed at building a large corpus for under-resourced languages using the Web as a source.…”
Section: Web-based Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…They introduced a new method which uses context within the document, and formulated the task as a coreference resolution problem, achieving higher performance than using existing techniques for collections with a large number of languages and small training data. Similar to ODIN, the work by Ralf Brown [7,8] has focused on expanding the number of languages considered simultaneously (developing a language identification system for over 1,100 languages). Alongside these works, the Crubadan Project, led by Kevin Scannell [60], aimed at building a large corpus for under-resourced languages using the Web as a source.…”
Section: Web-based Approachesmentioning
confidence: 99%
“…However, due to restrictions on the use of the Twitter API 7 , we distributed the corpora to the participants by including only the tweet IDs. We also provided them with a script to download the content of the tweets having the IDs, which scrapes the web page of each tweet to retrieve the content.…”
Section: Annotated Corpus and Evaluation Measuresmentioning
confidence: 99%
“…If the translation has fewer words, we use as many punctuation marks in order as we can; if the translation has more words, we repeat the last mark as many times as necessary. For instance, "menu_buscar_cambios_v26 [1]" becomes "menu_to_search_for_changes_v26 [1]". This heuristic works well for most directories since usually one punctuation mark is used consistently as a delimiter within one directory or file name.…”
Section: Translationmentioning
confidence: 99%
“…LI is the first step in text mining, information retrieval, speech processing and machine translation [2][3][4]. Although LI is often considered a solved problem, studies have verified that LI accuracy rapidly drops when identifying short text [5][6][7], and confusion errors often occur between languages in the same family or in similar language groups [3,4,8]. Therefore, considerable room for improvement exists in terms of improving short-text LI performance and similar language identification performance.…”
Section: Introductionmentioning
confidence: 99%