Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Hu, Junfeng; Singh, Abhinav; Holzenberger, Nils; Post, Matt; Durme, Benjamin Van

doi:10.18653/v1/k19-1005

Cited by 37 publications

(48 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BART is a Transformer (Vaswani et al, 2017) neural network trained on a large unlabeled corpus with a sentence reconstruction loss. We fine-tune it for 4 epochs on sentence pairs from PARABANK 2 (Hu et al, 2019a), which is a paraphrase dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus. We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (Junczys-Dowmunt, 2018), and use only one of the five paraphrases provided for each sentence.…”

Section: Autoqa Implementationmentioning

confidence: 99%

See 1 more Smart Citation

AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data

Xu¹,

J.²,

Campagna³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We propose AutoQA, a methodology and toolkit to generate semantic parsers that answer questions on databases, with no manual effort. Given a database schema and its data, AutoQA automatically generates a large set of high-quality questions for training that covers different database operations. It uses automatic paraphrasing combined with templatebased parsing to find alternative expressions of an attribute in different parts of speech. It also uses a novel filtered auto-paraphraser to generate correct paraphrases of entire sentences.We apply AutoQA to the Schema2QA dataset and obtain an average logical form accuracy of 62.9% when tested on natural questions, which is only 6.4% lower than a model trained with expert natural language annotations and paraphrase data collected from crowdworkers. To demonstrate the generality of AutoQA, we also apply it to the Overnight dataset. AutoQA achieves 69.8% answer accuracy, 16.4% higher than the state-of-the-art zero-shot models and only 5.2% lower than the same model trained with human data.

show abstract

Section: Autoqa Implementationmentioning

confidence: 99%

“…Most large-scale paraphrasing datasets are built using bilingual text(Ganitkevitch et al, 2013) and machine translation(Mallinson et al, 2017) or obtained with noisy heuristics(Prakash et al, 2016). Based on human judgement, even some of the better paraphrasing datasets score only 68%-84% on semantic similarity(Hu et al, 2019a, Yang et al, 2019.…”

mentioning

confidence: 99%

AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data

Xu¹,

J.²,

Campagna³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…We optimize using Adam (Kingma and Ba, 2015). We train on PARABANK2 (Hu et al, 2019c), an English paraphrase dataset. 2 PARABANK2 was generated by training an MT system on CzEng 1.7 (a Czech−English bitext with over 50 million lines (Bojar et al, 2016)), re-translating the Czech training sentences, and pairing the English output with the original English translation.…”

Section: Paraphrasermentioning

confidence: 99%

Simulated multiple reference training improves low-resource machine translation

Khayrallah

Thompson

Post

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser's distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation.

show abstract

“…Table 8 shows the Spearman ρ coefficient with STSbenchmark judgments for cosine and approximate LSH Hamming distances of embeddings for BERT, SBERT (and larger variant SRoBERTa), and pBERT (Hu et al, 2019b), a BERT model fine-tuned to predict paraphrastic similarity, albiet not via angular similarity of embeddings. Table 9 provides details regarding the distributions of sentences into LSH bins of differing levels of granularity using SRoBERTa-L embeddings.…”

Section: E Cosine/lsh Hamming Correlations With Sts and Bin Statisticsmentioning

confidence: 99%

COD3S: Diverse Generation with Discrete Semantic Signatures

Weir¹,

Sedoc²,

Durme³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

We present COD3S, a novel method for generating semantically diverse sentences using neural sequence-to-sequence (seq2seq) models. Conditioned on an input, seq2seq models typically produce semantically and syntactically homogeneous sets of sentences and thus perform poorly on one-to-many sequence generation tasks. Our two-stage approach improves output diversity by conditioning generation on locality-sensitive hash (LSH)-based semantic sentence codes whose Hamming distances highly correlate with human judgments of semantic textual similarity. Though it is generally applicable, we apply COD3S to causal generation, the task of predicting a proposition's plausible causes or effects. We demonstrate through automatic and human evaluation that responses produced using our method exhibit improved diversity without degrading task performance.

show abstract

Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Cited by 37 publications

References 41 publications

AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data

AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data

Simulated multiple reference training improves low-resource machine translation

COD3S: Diverse Generation with Discrete Semantic Signatures

Contact Info

Product

Resources

About