Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP

Chen, Anthony; Gudipati, Pallavi; Longpre, Shayne; Ling, Xiao; Singh, Sameer

doi:10.18653/v1/2021.acl-long.345

Cited by 15 publications

(30 citation statements)

References 33 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…L type (Q) uses type labels to form positive and negative pairs over queries. 6 Let P type (q) be the set of all queries in a batch that share the same type t as a query q and N type (q) be the other queries in the batch with a different type. Then L type (Q) is:…”

Section: Type-enforced Contrastive Lossmentioning

confidence: 99%

“…Retrieving the correct George Washington in the query above-George Washington the baseball player, rather than George Washington the president-requires the retriever to recognize that keywords "team" and "play" imply George Washington is an athlete. However, recent work has shown that state-ofthe-art retrievers exhibit popularity biases and struggle to resolve ambiguous mentions of rare "tail" entities [6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Leszczynski¹,

Fu²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Entity retrieval-retrieving information about entity mentions in a query-is a key step in open-domain tasks, such as question answering or fact checking. However, state-of-the-art entity retrievers struggle to retrieve rare entities for ambiguous mentions due to biases towards popular entities. Incorporating knowledge graph types during training could help overcome popularity biases, but there are several challenges: (1) existing type-based retrieval methods require mention boundaries as input, but open-domain tasks run on unstructured text, (2) type-based methods should not compromise overall performance, and (3) type-based methods should be robust to noisy and missing types. In this work, we introduce TABi, a method to jointly train bi-encoders on knowledge graph types and unstructured text for entity retrieval for open-domain tasks. TABi leverages a type-enforced contrastive loss to encourage entities and queries of similar types to be close in the embedding space. TABi improves retrieval of rare entities on the Ambiguous Entity Retrieval (AmbER) sets, while maintaining strong overall retrieval performance on open-domain tasks in the KILT benchmark compared to state-of-the-art retrievers. TABi is also robust to incomplete type systems, improving rare entity retrieval over baselines with only 5% type coverage of the training dataset. We make our code publicly available. 12 1 https://github.com/HazyResearch/tabi 2 Accepted to Findings of ACL 2022. 3 We use ER to refer to the page-level document retrieval setting, where entities correspond to Wikipedia pages.

show abstract

Section: Type-enforced Contrastive Lossmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Leszczynski¹,

Fu²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The framework maps a QA instance x = (q, a, c), with query q, answer a, and the context passage c in which a appears, to x = (q, a , c ) where a is replaced by substitution answer a as the gold answer, and where all occurrences of a in c have been replaced with a , producing new context c . This substitution framework extends partiallyautomated dataset creation techniques introduced by Chen et al (2021) for Ambiguous Entity Retrieval (AmbER). Our dataset derivation follows two steps: (1) identifying QA instances with named entity answers, and (2) replacing all occurrences of the answer in the context with a substituted entity, effectively changing the answer.…”

Section: Substitution Frameworkmentioning

confidence: 99%

“…How does Popularity of an Answer Entity impact Memorization? Using popularity substitution we examine if models are biased towards predicting more popular answers (Shwartz et al, 2020;Chen et al, 2021). Limiting our focus to the Person answer category, we order all PER Wikidata entities by popularity (approximated by Wikipedia monthly page views) and stratify them into five evenly sized popularity buckets.…”

Section: How Doesmentioning

confidence: 99%

Entity-Based Knowledge Conflicts in Question Answering

Longpre¹,

Perisetla²,

Chen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Knowledge-dependent tasks typically use two sources of knowledge: parametric, learned at training time, and contextual, given as a passage at inference time. To understand how models use these sources together, we formalize the problem of knowledge conflicts, where the contextual information contradicts the learned information. Analyzing the behaviour of popular models, we measure their over-reliance on memorized information (the cause of hallucinations), and uncover important factors that exacerbate this behaviour. Lastly, we propose a simple method to mitigate over-reliance on parametric knowledge which minimizes hallucination and improves out-of-distribution generalization by 4% − 7%. Our findings demonstrate the importance for practitioners to evaluate model tendency to hallucinate rather than read, and show that our mitigation strategy encourages generalization to evolving information (i.e., time-dependent queries). To encourage these practices, we have released our framework for generating knowledge conflicts. 1

show abstract

“…WikiDisamb30 (Ferragina and Scaiella, 2012), ACE and MSNBC (Ratinov et al, 2011), WNED-CWEB and WNED-WIKI (Guo and Barbosa, 2018) CoNLL-YAGO (Hoffart et al, 2011), and TAC KBP Entity Discovery and Linking dataset (Ji et al, 2017). The recently introduced Ambiguous Entity Retrieval (AmbER) dataset by Chen et al (2021) is an exception, including subsets of identically named entities for the purpose of fact checking, slot filling, and question-answering tasks. AmBer is limited to Wikipedia text and was automatically generated.…”

Section: Introductionmentioning

confidence: 99%

Namesakes: Ambiguously Named Entities from Wikipedia and News

Vasilyev¹,

Altun²,

Vyas³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present Namesakes, a dataset of ambiguously named entities obtained from Englishlanguage Wikipedia and news articles. It consists of 58862 mentions of 4148 unique entities and their namesakes: 1000 mentions from news, 28843 from Wikipedia articles about the entity, and 29019 Wikipedia backlink mentions. Namesakes should be helpful in establishing challenging benchmarks for the task of named entity linking (NEL).

show abstract

Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP

Cited by 15 publications

References 33 publications

TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Entity-Based Knowledge Conflicts in Question Answering

Namesakes: Ambiguously Named Entities from Wikipedia and News

Contact Info

Product

Resources

About