Different Flavors of GUM: Evaluating Genre and Sentence Type
            Effects on Multilayer Corpus Annotation Quality

Zeldes, Amir; Simonson, Dan

doi:10.18653/v1/w16-1709

Cited by 3 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We see three directions for future research in this space. This type of quantitative characterization of semantic relations could be extended to other genres [2][3][4]. Alternatively, additional semantic or pragmatic relations could be annotated at both the sentence and document level [9].…”

Section: Discussionmentioning

confidence: 99%

“…Multi-layered corpora are corpora that are annotated for multiple, mutually independent layers of natural language information on the same text [1][2][3][4][5]. While the contents of one layer may not be directly and immediately inferred from the contents of another, there may nonetheless be some correlation between elements of one layer and another.…”

Section: Related Research 21 Multi-layered Corporamentioning

confidence: 99%

See 1 more Smart Citation

Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations

Williamson,

Cao,

Chen

et al. 2023

Information

View full text Add to dashboard Cite

This paper introduces a multi-layered cross-genre corpus, annotated for coreference resolution, causal relations, and temporal relations, comprising a variety of genres, from news articles and children’s stories to Reddit posts. Our results reveal distinctive genre-specific characteristics at each layer of annotation, highlighting unique challenges for both annotators and machine learning models. Children’s stories feature linear temporal structures and clear causal relations. In contrast, news articles employ non-linear temporal sequences with minimal use of explicit causal or conditional language and few first-person pronouns. Lastly, Reddit posts are author-centered explanations of ongoing situations, with occasional meta-textual reference. Our annotation schemes are adapted from existing work to better suit a broader range of text types. We argue that our multi-layered cross-genre corpus not only reveals genre-specific semantic characteristics but also indicates a rich contextual interplay between the various layers of semantic information. Our MLCG corpus is shared under the open-source Apache 2.0 license.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Research 21 Multi-layered Corporamentioning

confidence: 99%

Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations

Williamson,

Cao,

Chen

et al. 2023

Information

View full text Add to dashboard Cite

show abstract

“…One reviewer has asked how StanfordNLP compares to other available libraries, such as Spacy (https://spacy.io/). While we do not have up to date numbers for Spacy, which was not featured in the recent CoNLL shared task on Universal Dependencies parsing, the most recent numbers reported in(Zeldes and Simonson, 2016) do not suggest that it would outperform StanfordNLP.…”

mentioning

confidence: 81%

AMALGUM -- A Free, Balanced, Multilayer English Web Corpus

Gessler,

Peng,

Liu

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a "better than NLP" benchmark and evaluate the accuracy of the resulting resource.

show abstract

“…Interest in this question re-emerged recently. For example, focusing on annotation difficulty, Zeldes and Simonson (2016) remark "that domain adaptation may be folding in sentence type effects", motivated by earlier findings by Silveira et al (2014) who remark that "[t]he most striking difference between the two types of data [Web and newswire] has to do with imperatives, which occur two orders of magnitude more often in the EWT [English Web Treebank]." A very recent paper examines word order properties and their impact on parsing taking a control experiment approach (Gulordava and Merlo, 2016).…”

Section: Fortuitous Datamentioning

confidence: 99%

What to do about non-standard (or non-canonical) language in NLP

Plank¹

2016

Preprint

View full text Add to dashboard Cite

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technologies to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., sociodemographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language.In this paper, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for leveraging what I call fortuitous data, i.e., nonobvious data that is hitherto neglected, hidden in plain sight, or raw data that needs to be refined. If we embrace the variety of this heterogeneous data by combining it with proper algorithms, we will not only produce more robust models, but will also enable adaptive language technology capable of addressing natural language variation.

show abstract

Different Flavors of GUM: Evaluating Genre and Sentence Type Effects on Multilayer Corpus Annotation Quality

Cited by 3 publications

References 19 publications

Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations

Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations

AMALGUM -- A Free, Balanced, Multilayer English Web Corpus

What to do about non-standard (or non-canonical) language in NLP

Contact Info

Product

Resources

About