2020
DOI: 10.15439/2020f20
|View full text |Cite
|
Sign up to set email alerts
|

Overview of the Transformer-based Models for NLP Tasks

Abstract: In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern architecture quickly revolutionized the natural language processing world. Models like GPT and BERT relying on this Transformer architecture have fully outperformed the previous state-of-theart networks. It surpassed the earlier approaches by such a wide margin that all the recent cutting edge models seem to rely on these Transformer-based architectures. In this paper, we provide an overview and explanations of th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
54
0
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 174 publications
(84 citation statements)
references
References 24 publications
0
54
0
2
Order By: Relevance
“…During masked language modelling, input tokens are randomly masked and subsequently predicted in order to obtain a "deep bidirectional representation" [15]. This allows BERT to counter the "unidirectional constraint" [19] of other language models such as GPT [46] by not allowing the model to "see itself" and thus "trivially predict the next token" when learning both right to left and left to right [19]. The next stage of pretraining takes the form of binarised next sentence prediction where sentence A precedes sentence B 50% of the time, allowing the model to learn the "relationship between two sentences" [19].…”
Section: B Bertmentioning
confidence: 99%
See 1 more Smart Citation
“…During masked language modelling, input tokens are randomly masked and subsequently predicted in order to obtain a "deep bidirectional representation" [15]. This allows BERT to counter the "unidirectional constraint" [19] of other language models such as GPT [46] by not allowing the model to "see itself" and thus "trivially predict the next token" when learning both right to left and left to right [19]. The next stage of pretraining takes the form of binarised next sentence prediction where sentence A precedes sentence B 50% of the time, allowing the model to learn the "relationship between two sentences" [19].…”
Section: B Bertmentioning
confidence: 99%
“…This allows BERT to counter the "unidirectional constraint" [19] of other language models such as GPT [46] by not allowing the model to "see itself" and thus "trivially predict the next token" when learning both right to left and left to right [19]. The next stage of pretraining takes the form of binarised next sentence prediction where sentence A precedes sentence B 50% of the time, allowing the model to learn the "relationship between two sentences" [19]. BERT models are then fine tuned by adding a classification layer and updating all parameters based on a downstream task, in this case, fake news classification.…”
Section: B Bertmentioning
confidence: 99%
“…26 Specifically, in their work Devlin et al reported that BERT transformer mode significantly outperformed bidirectional LSTM (state-of-the-art at that time) on the General Language Understanding Evaluation (GLUE) 61 benchmark with 71 and 82 average GLUE performance, Transformer models leverage large text corpora, akin to BookCorpus 62 or the English Wikipedia data set, and high expressive capacity to define the new state-of-the-art performance on a plethora of NLP tasks. These tasks include text classification, named entity recognition (NER), semantic text similarity (STS), text summary, question answering (QA), reading comprehension, knowledge discovery (KD) and mapping and other (reviewed in 63 ). Further boost in performance in the novel transformer architectures is achieved through the multi-headed attention mechanism.…”
Section: Host-pathogen Interactions Analysis From the Language Data In The Scientific Publicationsmentioning
confidence: 99%
“…Multitask learning [10] aims to learn individual sub-tasks separately and use those learnings inductively to solve a main task by identifying the dependence between the tasks. Separate multitask models are built to predict the rate of change of stock prices and to predict the actual stock price itself.…”
Section: Id Multitask Learningmentioning
confidence: 99%