Unsupervised Cross-lingual Representation Learning at Scale

Conneau, Alexis; Khandelwal, Kartikay; Goyal, Naman; Chaudhary, Vishrav; Wenzek, Guillaume; Guzmán, Francisco; Grave, Édouard; Ott, Myle; Zettlemoyer, Luke; Stoyanov, Veselin

doi:10.48550/arxiv.1911.02116

Cited by 266 publications

(419 citation statements)

References 0 publications

Supporting

Mentioning

411

Contrasting

Unclassified

Order By: Relevance

“…As examples of neural CLIR models, we evaluated vanilla reranking models [26] fine-tuned with MS-MARCO-v1 [2] for at most one epoch with various multi-language pretrained models, including multilingual-BERT (mBERT) [13], XLM-Roberta-large (XLM-R) [8], and infoXLM-large [6]. Model checkpoints were selected by nDCG@100 on HC4 dev sets.…”

Section: Baseline Runsmentioning

confidence: 99%

HC4: A New Suite of Test Collections for Ad Hoc CLIR

Lawrie¹,

Mayfield²,

Oard³

et al. 2022

Preprint

View full text Add to dashboard Cite

HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments. New test collections are needed because existing CLIR test collections built using pooling of traditional CLIR runs have systematic gaps in their relevance judgments when used to evaluate neural CLIR methods. The HC4 collections contain 60 topics and about half a million documents for each of Chinese and Persian, and 54 topics and five million documents for Russian. Active learning was used to determine which documents to annotate after being seeded using interactive search and judgment. Documents were judged on a three-grade relevance scale. This paper describes the design and construction of the new test collections and provides baseline results for demonstrating their utility for evaluating systems.

show abstract

Section: Baseline Runsmentioning

confidence: 99%

HC4: A New Suite of Test Collections for Ad Hoc CLIR

Lawrie¹,

Mayfield²,

Oard³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…• Multilingual BERT (mBERT) is a BERT [41] model pretrained on Wikipedia data having over 100 languages with a masked language modeling objective. • XLM-Roberta (XLM-R) [42] is a transformer-based masked language model pretrained on Common Crawl data having about 100 languages. It was proposed by Facebook and happened to be one of the best-performing transformer models for multilingual tasks.…”

Section: Baseline Approach : Fine-tuning Transformer Modelsmentioning

confidence: 99%

Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets

Farooqi¹,

Ghosh²,

Shah³

2021

Preprint

View full text Add to dashboard Cite

In the current era of the internet, where social media platforms are easily accessible for everyone, people often have to deal with threats, identity attacks, hate, and bullying due to their association with a cast, creed, gender, religion, or even acceptance or rejection of a notion. Existing works in hate speech detection primarily focus on individual comment classification as a sequence labelling task and often fail to consider the context of the conversation. The context of a conversation often plays a substantial role when determining the author's intent and sentiment behind the tweet. This paper describes the system proposed by team MIDAS-IIITD for HASOC 2021 subtask 2, one of the first shared tasks focusing on detecting hate speech from Hindi-English code-mixed conversations on Twitter. We approach this problem using neural networks, leveraging the transformer's cross-lingual embeddings and further finetuning them for low-resource hate-speech classification in transliterated Hindi text. Our best performing system, a hard voting ensemble of Indic-BERT, XLM-RoBERTa, and Multilingual BERT, achieved a macro F1 score of 0.7253, placing us 1 𝑠𝑡 on the overall leaderboard standings.

show abstract

“…A plethora of architectures have been proposed implementing the attention-based mechanism since it was proposed. Models such as BERT [7], Roberta [8], XML [9] or XLM-RoBERTa [10] are being used in a large number of NLP tasks with great success.…”

Section: The Transformer Architecturementioning

confidence: 99%

FacTeR-Check: Semi-automated fact-checking through Semantic Similarity and Natural Language Inference

Martín¹,

Huertas‐Tato²,

Huertas-García³

et al. 2021

Preprint

View full text Add to dashboard Cite

Our society produces and shares overwhelming amounts of information through the Online Social Networks (OSNs). Within this environment, misinformation and disinformation have proliferated, becoming a public safety concern on every country. Allowing the public and professionals to efficiently find reliable evidence about the factual veracity of a claim is crucial to mitigate this harmful spread. To this end, we propose FacTeR-Check, a multilingual architecture for semiautomated fact-checking that can be used for either the general public but also useful for factchecking organisations. FacTeR-Check enables retrieving fact-checked information, unchecked claims verification and tracking dangerous information over social media. This architectures involves several modules developed to evaluate semantic similarity, to calculate natural language inference and to retrieve information from Online Social Networks. The union of all these modules builds a semi-automated fact-checking tool able of verifying new claims, to extract related evidence, and to track the evolution of a hoax on a OSN. While individual modules are validated on related benchmarks (mainly MSTS and SICK), the complete architecture is validated using a new dataset called NLI19-SP that is publicly released with COVID-19 related hoaxes and tweets from Spanish social media. Our results show state-of-the-art performance on the individual benchmarks, as well as producing useful analysis of the evolution over time of 61 different hoaxes.

show abstract

Unsupervised Cross-lingual Representation Learning at Scale

Cited by 266 publications

References 0 publications

HC4: A New Suite of Test Collections for Ad Hoc CLIR

HC4: A New Suite of Test Collections for Ad Hoc CLIR

Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets

FacTeR-Check: Semi-automated fact-checking through Semantic Similarity and Natural Language Inference

Contact Info

Product

Resources

About