Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition

Baldwin, Timothy; Marneffe, Marie-Catherine de; Han, Bo; Kim, Young-Bum; Ritter, Alan; Xu, Wei

doi:10.18653/v1/w15-4319

Cited by 162 publications

(161 citation statements)

References 27 publications

Supporting

Mentioning

156

Contrasting

Unclassified

Order By: Relevance

“…In 2015, the Workshop on Noisy User-generated Text (W-NUT) [4] Table 1 Named Entity Recognition and Linking challenges since 2013…”

Section: W-nutmentioning

confidence: 99%

Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series

Rizzo

Pereira

Varga³

et al. 2017

View full text Add to dashboard Cite

Abstract. The large number of tweets generated daily is providing decision makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domainspecific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributed to the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition and linking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This article reports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessons learnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets.

show abstract

“…In 2015, the Workshop on Noisy User-generated Text (W-NUT) [4] Table 1 Named Entity Recognition and Linking challenges since 2013…”

Section: W-nutmentioning

confidence: 99%

Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series

Rizzo

Pereira

Varga³

et al. 2017

View full text Add to dashboard Cite

show abstract

“…In CMC, items like emoticons have no corresponding standard form and require a special treatment when normalizing these texts. E.g., for the shared task of normalizing Twitter data (Baldwin et al, 2015) only all-alphanumeric tokens are normalized. This excludes tokens like =), :) and :-) from the normalization.…”

Section: Related Workmentioning

confidence: 99%

“…This is, however, negligible for standard texts: only 7% of the morphological words appearing more than once in the Tiger corpus (Brants et al, 2004), a corpus consisting of German newspaper texts, show variance, i.e., are realized by more than one type. 2 In non-standard texts, there is more variation: In the English Twitter texts used as the training data for the W-NUT 2015 shared task on normalization (Baldwin et al, 2015), 57% of the morphological words show variation. 3 This can be reduced to 16% by lowercasing every type.…”

Section: Related Workmentioning

confidence: 99%

Detecting spelling variants in non-standard texts

Barteld¹

2017

Proceedings of the Student Research Workshop at the 15th Conference Of the European Chapter of the Association for Co

View full text Add to dashboard Cite

Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as a non-standard spelling for tomorrow. Consequently, in normalization -the standard approach of dealing with spelling variation -so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computermediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed.

show abstract

“…These new challenges will push the state of the art in these speech processing tasks. The orthographic regularization shared task builds on other work on orthographic regularization in widely spoken languages (see, for example (Mohit et al, 2014;Rozovskaya et al, 2015;Baldwin et al, 2015) on social media text and Dale and Kilgariff (2011) on text produced by language learners), but pushes the frontiers of work in this area in several ways: While this proposed shared task has much in common with these previous shared tasks, endangered language text normalization poses additional interesting problems. In languages like English or Arabic, there is usually a single, established orthography in which almost all users have formal schooling and extensive digital corpora in this orthography that establish "correct" practices.…”

Section: Intellectual Merit: Research Interest Inmentioning

confidence: 99%

STREAMLInED Challenges: Aligning Research Interests with Shared Tasks

Levow¹,

Bender²,

Littell³

et al. 2017

Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

View full text Add to dashboard Cite

While there have been significant improvements in speech and language processing, it remains difficult to bring these new tools to bear on challenges in endangered language documentation. We describe an effort to bridge this gap through Shared Task Evaluation Campaigns (STECs) by designing tasks that are compelling to speech and natural language processing researchers while addressing technical challenges in language documentation and exploiting growing archives of endangered language data. Based on discussions at a recent NSF-funded workshop, we present overarching design principles for these tasks: including realistic settings, diversity of data, accessibility of data and systems, and extensibility, that aim to ensure the utility of the resulting systems. Three planned tasks embodying these principles are highlighted: spanning audio processing, orthographic regularization, and automatic production of interlinear glossed text. The planned data and evaluation methodologies are also presented, motivating each task by its potential to accelerate the work of researchers and archivists working with endangered languages. Finally, we articulate the interest of the tasks to both speech and NLP researchers and speaker communities.

show abstract

Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition

Cited by 162 publications

References 27 publications

Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series

Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series

Detecting spelling variants in non-standard texts

STREAMLInED Challenges: Aligning Research Interests with Shared Tasks

Contact Info

Product

Resources

About