GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R.

doi:10.48550/arxiv.1804.07461

Cited by 561 publications

(745 citation statements)

References 28 publications

Supporting

Mentioning

644

Contrasting

Unclassified

Order By: Relevance

“…The most known representatives of NLP benchmarks are GLUE [24] and Su-perGLUE [23]. The latter is the successor of the former, proposed with more challenging tasks to keep up with pacing progress in the NLP area.…”

Section: Related Workmentioning

confidence: 99%

“…In Artificial Intelligence (AI) research field, similar tasks often are grouped to a special benchmark containing a set of formalized Machine Learning (ML) problems with defined input data and performance metrics. For example, Ima-geNet [6] benchmark for image classification or General Language Understanding Evaluation (GLUE) benchmark [24]. The comparison of human and ML-model performances allows measuring the progress in a particular field.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

RuMedBench: A Russian Medical Language Understanding Benchmark

Blinov,

Reshetnikova,

Nesterov

et al. 2022

Preprint

View full text Add to dashboard Cite

The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the sensitive nature of the data in healthcare, such a benchmark partially closes the problem of Russian medical dataset absence. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. The remaining tasks are from existing datasets with a few modifications. A single-number metric expresses a model's ability to cope with the benchmark. Moreover, we implement several baseline models, from simple ones to neural networks with transformer architecture, and release the code. Expectedly, the more advanced models yield better performance, but even a simple model is enough for a decent result in some tasks. Furthermore, for all tasks, we provide a human evaluation. Interestingly the models outperform humans in the large-scale classification tasks. However, the advantage of natural intelligence remains in the tasks requiring more knowledge and reasoning.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

RuMedBench: A Russian Medical Language Understanding Benchmark

Blinov,

Reshetnikova,

Nesterov

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…PTMs such as GPT (Generative Pretrained Transformers) and BERT [26] (Bidirectional Encoder Representation from Transformer) have recently achieved great success in many complex natural language processing (NLP) tasks and become a milestone in the wider machine learning community. Thanks to the immensity of training data (for BERT, the pre-training corpus contains 3,300 million words) [26]) and the huge number of model parameters (the base version of BERT contains 110 million parameters while the large version of BERT contains 340 million parameters), some of these PTMs have surpassed human performance on multiple language understanding benchmarks [27] [28] [29], such as GLUE [30]. PTMs are now generally used as backbone for downstream tasks, because the rich knowledge stored implicitly in the huge amount of model parameters could be leveraged by fine-tuning them for specific tasks.…”

Section: Deep Learning In Automatic Hateful Message Detectionmentioning

confidence: 99%

“…The baseline provided by Kiela et al [37] including both unimodal PTMs and multimodal PTMs. The unimodal PTMs are BERT [14] (Text BERT), standard ResNet-152 [30] convolutional features from res-5c with average pooling (Image-Grid), and features from fc6 layer that are fine-tuned using weights of the fc7 layer (Image-Region). The multimodal baseline methods include supervised multimodal bitransformers [45] using either Image-Grid or Image-Region features (MMBT-Grid and MMBT-Region), versions of VilBERT [31] and Visual BERT [46] that were only unimodally pretrained and not pretrained on multimodal data (ViLBERT and Visual BERT)¿ The multimodal baselines are ViLBERT trained on Conceptual Captions [47] (ViLBERT CC) and Visual BERT trained on COCO dataset [48] (Visual BERT COCO).…”

Section: Visual-language Ptmmentioning

confidence: 99%

Multimodal Detection of Hateful Messages Using Visual-Linguistic Pre-Trained Deep Learning Models

Chen¹,

Pan

2022

Preprint

View full text Add to dashboard Cite

Online hateful messages, more commonly known as “hate speech”, have recently become a major social issue. Many studies have shown them to be detrimental for both the individuals and the society. Many online platforms have employed legions of moderators to manually identify and remove these messages, yet such practices are time-consuming, expensive, and often causing mental illness among the reviewers. As a solution, computational methods are applied to automatically identify and remove hateful messages. However, as online discussions are now often dominated by memes, a format that leverages both text and image to express users’ intents, many textual moderation methods have become obsolete. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capa- bility. In this work, we move closer to this goal by compositely using a Visual-Language Pre-Trained Model, an object detection model and a random forest classifier to achieve a 0.77 AUROC score on the hateful meme detection task, an improvement of 0.15 compared to the best baseline method.

show abstract

“…The architecture of the transformer [44], allowed to use the concept of attention (and specifically self-attention) very efficiently, and generate new and long sequences effectively and also more coherently. BERT [19] applied a Bidirectional Transformer to Language modeling, and presented state-of-the-art results in a variety of NLP tasks, like GLUE (General Language Understanding Evaluation) [45] task set, SQuAD (Stanford Question Answering Dataset) [36] v1.1 and v2.0 and SWAG (Situations With Adversarial Generations) [48]. In regards to generating novel text, even hard problems like literature (e.g.…”

Section: Introductionmentioning

confidence: 99%

Ownership and Creativity in Generative Models

Omri¹,

Tamir²

2021

Preprint

View full text Add to dashboard Cite

Machine learning generated content such as image artworks, textual poems and music become prominent in recent years. These tools attract much attention from the media, artists, researchers, and investors. Because these tools are data-driven, they are inherently different than the traditional creative tools which arises the question -who may own the content that is generated by these tools? In this paper we aim to address this question, we start by providing a background to this problem, raising several candidates that may own the content and arguments for each one of them. Then we propose a possible algorithmic solution in the vision-based model's regime. Finally, we discuss the broader implications of this problem.

show abstract

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Cited by 561 publications

References 28 publications

RuMedBench: A Russian Medical Language Understanding Benchmark

RuMedBench: A Russian Medical Language Understanding Benchmark

Multimodal Detection of Hateful Messages Using Visual-Linguistic Pre-Trained Deep Learning Models

Ownership and Creativity in Generative Models

Contact Info

Product

Resources

About