Learning Dense Representations for Entity Retrieval

Gillick, Daniel; Kulkarni, Sayali; Lansing, Larry; Presta, Alessandro; Baldridge, Jason; Ie, Eugene; García-Olano, Diego

doi:10.18653/v1/k19-1049

Cited by 160 publications

(180 citation statements)

References 25 publications

Supporting

Mentioning

167

Contrasting

Order By: Relevance

“…Basing entity representations on features of their Wikipedia pages has been a common approach in EL (e.g. Sil and Florian, 2016;Francis-Landau et al, 2016;Gillick et al, 2019;Wu et al, 2019), but we will need to generalize this to include multiple Wikipedia pages with possibly redundant features in many languages.…”

Section: Mel With Wikidata and Wikipediamentioning

confidence: 99%

“…Prior work showed that a dual encoder architecture can encode entities and contextual mentions in a dense vector space to facilitate efficient entity retrieval via nearest-neighbors search (Gillick et al, 2019;Wu et al, 2019). We take the same approach.…”

Section: Modelmentioning

confidence: 99%

“…Training with hard-negatives is highly effective in monolingual entity retrieval (Gillick et al, 2019), and we apply the technique they detail to our multilingual setting. Figure B1 in the Appendix for a larger view.…”

Section: Hard-negative Miningmentioning

confidence: 99%

See 2 more Smart Citations

Entity Linking in 100 Languages

Botha¹,

Shan²,

Gillick³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

120

View full text Add to dashboard Cite

We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The model outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare entities and low-resource languages pose challenges at this large-scale, so we advocate for an increased focus on zero-and few-shot evaluation. To this end, we provide Mewsli-9, a large new multilingual dataset 1 matched to our setting, and show how frequency-based analysis provided key insights for our model and training enhancements.

show abstract

Section: Mel With Wikidata and Wikipediamentioning

confidence: 99%

Section: Modelmentioning

confidence: 99%

See 1 more Smart Citation

Entity Linking in 100 Languages

Botha¹,

Shan²,

Gillick³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

120

View full text Add to dashboard Cite

show abstract

“…In contrast to interaction-based models, which are applied to query-document pairs, our approach is to decouple entity encoding and document encoding. Therefore we follow recent work in representation-based Entity Linking [14] and embed textual knowledge from the clinical domain into this representation. Our goal is to generalize entity representations, so the model will be able to align to existing taxonomies without retraining.…”

Section: Entity Spacementioning

confidence: 99%

“…The model uses multi-task learning [9] to align the sequence of sentences in a long document to the clinical knowledge encoded in pre-trained entity and aspect vector spaces. We use 1 Code and evaluation data is available at https://github.com/sebastianarnold/cdv a dual encoder architecture [15], which allows us to precompute discourse vectors for all documents and later answer ad-hoc queries over that corpus with short latency [14]. Consequently, the model predicts similarity scores with sentence granularity and does not require an extra inference step after the initial document indexing.…”

Section: Introductionmentioning

confidence: 99%

Learning Contextualized Document Representations for Healthcare Answer Retrieval

Arnold

Aken

Grundmann

et al. 2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

We present Contextual Discourse Vectors (CDV), a distributed document representation for efficient answer retrieval from long healthcare documents. Our approach is based on structured query tuples of entities and aspects from free text and medical taxonomies. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We use our continuous representations to resolve queries with short latency using approximate nearest neighbor search on sentence level. We apply the CDV model for retrieving coherent answer passages from nine English public health resources from the Web, addressing both patients and medical professionals. Because there is no end-to-end training data available for all application scenarios, we train our model with self-supervised data from Wikipedia. We show that our generalized model significantly outperforms several state-of-theart baselines for healthcare passage ranking and is able to adapt to heterogeneous domains without additional fine-tuning.

show abstract