XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation

Liang, Yilong; Duan, Nan; Gong, Yeyun; Wu, Ning; Guo, Fenfei; Qi, Weizhen; Gong, Ming; Shou, Linjun; Jiang, Daxin; Cao, Guihong; Fan, Xiaodong; Zhang, Ruofei; Agrawal, Rahul; Cui, Edward; Wei, Sining; Bharti, Taroon; Qiao, Ying; Chen, Jiun-Hung; Wu, Winnie; Liu, Shuguang; Yang, Fan; Campos, Daniel; Majumder, Rangan; Zhou, Ming

doi:10.18653/v1/2020.emnlp-main.484

Cited by 177 publications

(149 citation statements)

References 13 publications

(11 reference statements)

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…We show that our new contrastive learning alignment objectives outperform previous work (Cao et al, 2020) when applied to bitext from previous works or the OPUS-100 bitext. However, our experiments also produce a negative result.…”

Section: Introductionmentioning

confidence: 65%

“…enforcing similar words from different languages have similar representation, improvements can be attained through the use of explicit cross-lingually linked data during pretraining, such as bitexts (Conneau and Lample, 2019;Huang et al, 2019;Ji et al, 2019) and dictionaries . As with cross-lingual embeddings (Ruder et al, 2019), these data can be used to support explicit alignment objectives with either linear mappings (Wang et al, 2019(Wang et al, , 2020 or fine-tuning (Cao et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Do Explicit Alignments Robustly Improve Multilingual Encoders?

Dredze

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Multilingual BERT (Devlin et al., 2019, mBERT), XLM-RoBERTa (Conneau et al., 2019, XLMR) and other unsupervised multilingual encoders can effectively learn crosslingual representation. Explicit alignment objectives based on bitexts like Europarl or Mul-tiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.

show abstract

Section: Introductionmentioning

confidence: 65%

Section: Introductionmentioning

confidence: 99%

Do Explicit Alignments Robustly Improve Multilingual Encoders?

Dredze

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Namely, the English dataset is automatically translated into the desired language(s) using machine translation (MT); an augmented dataset composed of the original English text and all the translated copies is created; the mBERT model is fine-tuned on a subset of the dataset; and the resultant model is then used to solve the relevant downstream task in the desired language. Previous works have suggested that translating the original dataset to as large a number of languages as possible is beneficial (Liang et al, 2020). In this work, we show a more nuanced picture, where often selecting a subset of related languages is preferable.…”

Section: Introductionmentioning

confidence: 82%

“…This allows us to draw more holistic conclusions on the efficacy -and pitfalls -of transfer learning in the argument mining domain. It is interesting to compare these conclusions with other wide-scope multilingual NLU research, such as the XTREME (Hu et al, 2020) and XGLUE (Liang et al, 2020).…”

Section: Related Workmentioning

confidence: 90%

Multilingual Argument Mining: Datasets and Analysis

Toledo-Ronen¹,

Orbach²,

Bilu³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

The growing interest in argument mining and computational argumentation brings with it a plethora of Natural Language Understanding (NLU) tasks and corresponding datasets. However, as with many other NLU tasks, the dominant language is English, with resources in other languages being few and far between. In this work, we explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages, based on English datasets and the use of machine translation. We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments, presumably because quality is harder to preserve under translation. In addition, focusing on the translate-train approach, we show how the choice of languages for translation, and the relations among them, affect the accuracy of the resultant model. Finally, to facilitate evaluation of transfer learning on argument mining tasks, we provide a humangenerated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.

show abstract

“…XLM-Roberta is a variant of BERT with a different objective, and is trained in an unsupervised manner on a multi-lingual corpus. These models have achieved state-of-the-art results in NLU and NLG tasks across multiple languages for popular benchmarks such as XGLUE [13], XTREME [10].…”

Section: Methodsmentioning

confidence: 99%

Coarse and Fine-Grained Hostility Detection in Hindi Posts Using Fine Tuned Multilingual Embeddings

Elangovan

Maurya

et al. 2021

Combating Online Hostile Posts in Regional Languages During Emergency Situation

View full text Add to dashboard Cite

Due to the wide adoption of social media platforms like Facebook, Twitter, etc., there is an emerging need of detecting online posts that can go against the community acceptance standards. The hostility detection task has been well explored for resource-rich languages like English, but is unexplored for resource-constrained languages like Hindi due to the unavailability of large suitable data. We view this hostility detection as a multi-label multi-class classification problem. We propose an effective neural network-based technique for hostility detection in Hindi posts. We leverage pre-trained multilingual Bidirectional Encoder Representations of Transformer (mBERT) to obtain the contextual representations of Hindi posts. We have performed extensive experiments including different pre-processing techniques, pre-trained models, neural architectures, hybrid strategies, etc. Our best performing neural classifier model includes One-vs-the-Rest approach where we obtained 92.60%, 81.14%, 69.59%, 75.29% and 73.01% F1 scores for hostile, fake, hate, offensive, and defamation labels respectively. The proposed model (https://github.com/Arko98/Hostility-Detection-in-Hindi-Constraint-2021) outperformed the existing baseline models and emerged as the state-of-the-art model for detecting hostility in the Hindi posts.

show abstract

XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation

Cited by 177 publications

References 13 publications

Do Explicit Alignments Robustly Improve Multilingual Encoders?

Do Explicit Alignments Robustly Improve Multilingual Encoders?

Multilingual Argument Mining: Datasets and Analysis

Coarse and Fine-Grained Hostility Detection in Hindi Posts Using Fine Tuned Multilingual Embeddings

Contact Info

Product

Resources

About