Olga Uryupina scite author profile

The ARRAU corpus is an anaphorically annotated corpus of English providing rich linguistic information about anaphora resolution. The most distinctive feature of the corpus is the annotation of a wide range of anaphoric relations, including bridging references and discourse deixis in addition to identity (coreference). Other distinctive features include treating all NPs as markables, including nonreferring NPs; and the annotation of a variety of morphosyntactic and semantic mention and entity attributes, including the genericity status of the entities referred to by markables. The corpus however has not been extensively used for anaphora resolution research so far. In this paper, we discuss three datasets extracted from the ARRAU corpus to support the three subtasks of the CRAC 2018 Shared Taskidentity anaphora resolution over ARRAU-style markables, bridging references resolution, and discourse deixis; the evaluation scripts assessing system performance on those datasets; and preliminary results on these three tasks that may serve as baseline for subsequent research in these phenomena.

show abstract

Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus

Uryupina

Artstein

Bristot³

et al. 2019

Nat. Lang. Eng.

View full text Add to dashboard Cite

This paper presents the second release of arrau, a multigenre corpus of anaphoric information created over 10 years to provide data for the next generation of coreference/anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of arrau include the following: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. The current version of the dataset contains 350K tokens and is publicly available from LDC. In this paper, we discuss in detail all the distinguishing features of the corpus, so far only partially presented in a number of conference and workshop papers, and we also discuss the development between the first release of arrau in 2008 and this second one.

show abstract

Supervised Clustering of Questions into Intents for Dialog System Applications

Haponchyk¹,

Uva²,

Yu³

et al. 2018

View full text Add to dashboard Cite

Modern automated dialog systems require complex dialog managers able to deal with user intent triggered by high-level semantic questions. In this paper, we propose a model for automatically clustering questions into user intents to help the design tasks. Since questions are short texts, uncovering their semantics to group them together can be very challenging. We approach the problem by using powerful semantic classifiers from question duplicate/matching research along with a novel idea of supervised clustering methods based on structured output. We test our approach on two intent clustering corpora, showing an impressive improvement over previous methods for two languages/domains.

show abstract

Opinion Mining on YouTube

Severyn¹,

Moschitti²,

Uryupina³

et al. 2014

View full text Add to dashboard Cite

This paper defines a systematic approach to Opinion Mining (OM) on YouTube comments by (i) modeling classifiers for predicting the opinion polarity and the type of comment and (ii) proposing robust shallow syntactic structures for improving model adaptability. We rely on the tree kernel technology to automatically extract and learn features with better generalization power than bag-of-words. An extensive empirical evaluation on our manually annotated YouTube comments corpus shows a high classification accuracy and highlights the benefits of structural models in a cross-domain setting.

show abstract

High-precision identification of discourse new and unique noun phrases

Uryupina

2003

View full text Add to dashboard Cite

Coreference resolution systems usually attempt to find a suitable antecedent for (almost) every noun phrase. Recent studies, however, show that many definite NPs are not anaphoric. The same claim, obviously, holds for the indefinites as well. In this study we try to learn automatically two classifications, ¢ ¡ ¤ £ ¦ ¥ § © ¤ ¥ and ¢ £ " ! # , relevant for this problem. We use a small training corpus (MUC-7), but also acquire some data from the Internet. Combining our classifiers sequentially, we achieve 88.9% precision and 84.6% recall for discourse new entities. We expect our classifiers to provide a good prefiltering for coreference resolution systems, improving both their speed and performance. © brand new NPs introduce entities which are both discourse and hearer new ("a bus"), subclass of them, brand new anchored NPs contain explicit link to some given discourse entity ("a guy I work with"), © unused NPs introduce discourse new, but hearer old entities ("Noam Chomsky"),

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Olga Uryupina

Anaphora Resolution with the ARRAU Corpus

Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus

Supervised Clustering of Questions into Intents for Dialog System Applications

Opinion Mining on YouTube

High-precision identification of discourse new and unique noun phrases

Contact Info

Product

Resources

About