Hierarchical Document Encoder for Parallel Corpus Mining

Guo, Mandy; Yang, Yinfei; Stevens, Keith; Cer, Daniel; Ge, Heming; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray

doi:10.48550/arxiv.1906.08401

Cited by 1 publication

(4 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To understand how its representation quality changes with input length, we report its alignment performance of encoding whole documents and encoding only the first 512 tokens respectively. The average of sentence embedding has been shown to be a strong approach to derive document representation (Guo et al, 2019). We include it as another strong baseline in our experiments, with which we can analyze the gains brought by sentence weighting.…”

Section: Resultsmentioning

confidence: 99%

“…Averaging is a simple yet strong approach to represent compositional semantics. Strong performance comparable to supervised neural models has been achieved by sentence embeddings derived from average word embeddings (Arora et al, 2017) and document embeddings from average sentence embeddings (Guo et al, 2019) respectively.…”

Section: Weighted Document Representationmentioning

confidence: 99%

“…Despite a large body of research on monolingual document representations, there have been very few works in cross-lingual setting. The most recent work designs hierarchical multilingual document encoder (HiDE), which is trained on a large corpus of parallel documents (Guo et al, 2019). It only experiments with English-French and English-Spanish language pairs.…”

Section: Related Workmentioning

confidence: 99%

“…Transformer based models such as BERT, XLM and XLM-RoBERTa have strict length limits on inputs and therefore cannot process long documents. While complex, neural-based models have been proposed, averaging sentence embeddings has remained a hard-to-beat baseline (Guo et al, 2019). Here, we propose a document-level model based on weighted sentence embeddings, and the sentence weights are inversely related with its density in the corpus, akin to an inverse corpus frequency.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models

Gong¹,

Chaudhary²,

Tang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Cross-lingual document representations enable language understanding in multilingual contexts and allow transfer learning from high-resource to low-resource languages at the document level.Recently large pretrained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks. It is tempting to apply these cross-lingual models to document representation learning. However, there are two challenges: (1) these models impose high costs on long document processing and thus many of them have strict length limit; (2) model finetuning requires extra data and computational resources, which is not practical in resourcelimited settings. In this work, we address these challenges by proposing unsupervised Language-Agnostic Weighted Document Representations (LAWDR). We study the geometry of pre-trained sentence embeddings and leverage it to derive document representations without fine-tuning. Evaluated on crosslingual document alignment, LAWDR demonstrates comparable performance to state-ofthe-art models on benchmark datasets.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Weighted Document Representationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations