Finding and identifying text in 900+ languages

Brown, Ralf D.

doi:10.1016/j.diin.2012.05.004

Cited by 30 publications

(22 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They introduced a new method which uses context within the document, and formulated the task as a coreference resolution problem, achieving higher performance than using existing techniques for collections with a large number of languages and small training data. Similar to ODIN, the work by Ralf Brown [7,8] has focused on expanding the number of languages considered simultaneously (developing a language identification system for over 1,100 languages). Alongside these works, the Crubadan Project, led by Kevin Scannell [60], aimed at building a large corpus for under-resourced languages using the Web as a source.…”

Section: Web-based Approachesmentioning

confidence: 99%

See 1 more Smart Citation

TweetLID: a benchmark for tweet language identification

Zubiaga

Vicente²,

Gamallo

et al. 2015

Lang Resources & Evaluation

View full text Add to dashboard Cite

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (i) distinction of similar languages, (ii) detection of multilingualism in a single document, and (iii) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.

show abstract

Section: Web-based Approachesmentioning

confidence: 99%

“…However, due to restrictions on the use of the Twitter API 7 , we distributed the corpora to the participants by including only the tweet IDs. We also provided them with a script to download the content of the tweets having the IDs, which scrapes the web page of each tweet to retrieve the content.…”

Section: Annotated Corpus and Evaluation Measuresmentioning

confidence: 99%

TweetLID: a benchmark for tweet language identification

Zubiaga

Vicente²,

Gamallo

et al. 2015

Lang Resources & Evaluation

View full text Add to dashboard Cite

show abstract

“…If the translation has fewer words, we use as many punctuation marks in order as we can; if the translation has more words, we repeat the last mark as many times as necessary. For instance, "menu_buscar_cambios_v26 [1]" becomes "menu_to_search_for_changes_v26 [1]". This heuristic works well for most directories since usually one punctuation mark is used consistently as a delimiter within one directory or file name.…”

Section: Translationmentioning

confidence: 99%

Language translation for file paths

Rowe

Schwamm

Garfinkel

2013

Digital Investigation

View full text Add to dashboard Cite

Forensic examiners are frequently confronted with content in languages that they do not understand, and they could benefit from machine translation into their native language. But automated translation of file paths is a difficult problem because of the minimal context for translation and the frequent mixing of multiple languages within a path. This work developed a prototype implementation of a file-path translator that first identifies the language for each directory segment of a path, and then translates to English those that are not already English nor artificial words. Brown's LA-Strings utility for language identification was tried, but its performance was found inadequate on short strings and it was supplemented with clues from dictionary lookup, Unicode character distributions for languages, country of origin, and language-related keywords. To provide better data for language inference, words used in each directory over a large corpus were aggregated for analysis. The resulting directory-language probabilities were combined with those for each path segment from dictionary lookup and character-type distributions to infer the segment's most likely language. Tests were done on a corpus of 50.1 million file paths looking for 35 different languages. Tests showed 90.4% accuracy on identifying languages of directories and 93.7% accuracy on identifying languages of directory/file segments of file paths, even after excluding 44.4% of the paths as obviously English or untranslatable. Two of seven proposed language clues were shown to impair directory-language identification. Experiments also compared three translation methods: the Systran translation tool, Google Translate, and word-for-word substitution using dictionaries. Google Translate usually performed the best, but all still made errors with European languages and a significant number of errors with Arabic and Chinese.

show abstract

“…LI is the first step in text mining, information retrieval, speech processing and machine translation [2][3][4]. Although LI is often considered a solved problem, studies have verified that LI accuracy rapidly drops when identifying short text [5][6][7], and confusion errors often occur between languages in the same family or in similar language groups [3,4,8]. Therefore, considerable room for improvement exists in terms of improving short-text LI performance and similar language identification performance.…”

Section: Introductionmentioning

confidence: 99%

On Hierarchical Text Language-Identification Algorithms

Hasimu

Silamu

2018

Algorithms

View full text Add to dashboard Cite

Abstract:Text on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reached in four-stage language identification was 0.9799. The experimental results verified that, after script identification, language group identification and similar language group identification, the performance of the language identification algorithm improved with each stage. Notably, the language identification accuracy between similar languages improved substantially. We also investigated how foreign content in a language affects language identification.

show abstract

Finding and identifying text in 900+ languages

Cited by 30 publications

References 1 publication

TweetLID: a benchmark for tweet language identification

TweetLID: a benchmark for tweet language identification

Language translation for file paths

On Hierarchical Text Language-Identification Algorithms

Contact Info

Product

Resources

About