2020
DOI: 10.48550/arxiv.2008.09144
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Abstract: In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on the sentence similarity and sentence entailment tasks. We show that our Portuguese pretrained models have significantly better perfor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0
5

Year Published

2020
2020
2022
2022

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 11 publications
(19 citation statements)
references
References 12 publications
0
14
0
5
Order By: Relevance
“…Rows 1 to 6 in Table 3 present the results. We experimented with distinct T5 Base models: the original T5 Base with English tokenizer (row 1), with further pretraining on Portuguese corpus [7] but using the English tokenizer (row 2), and PTT5 Base (row 3). We notice that the adoption of a Portuguese tokenizer by PTT5 provided an error reduction in EM of 70.3% over the previous experiment that used the same pretraining dataset (rows 2 vs 3).…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Rows 1 to 6 in Table 3 present the results. We experimented with distinct T5 Base models: the original T5 Base with English tokenizer (row 1), with further pretraining on Portuguese corpus [7] but using the English tokenizer (row 2), and PTT5 Base (row 3). We notice that the adoption of a Portuguese tokenizer by PTT5 provided an error reduction in EM of 70.3% over the previous experiment that used the same pretraining dataset (rows 2 vs 3).…”
Section: Resultsmentioning
confidence: 99%
“…We work with texts in Portuguese, so this English model acts as a weak baseline. PTT5: It consists of a T5 model, without changes in the architecture, pretrained on a large Brazilian Portuguese corpus [7]. We employ PTT5 Base which showed to be not only more efficient but also more effective than the PTT5 Large [7].…”
Section: Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…This significantly limits their use given that roughly 80% of the world population does not speak English (Crystal, 2008). One way the community has addressed this English-centricity has been to release dozens of models that have instead been pre-trained on a single non-English language (Carmo et al, 2020;de Vries et al, 2019;Le et al, 2019;Martin et al, 2019;Delobelle et al, 2020;Malmsten et al, 2020;Nguyen and Nguyen, 2020;Polignano et al, 2019, etc.). A more general solution is to produce multilingual models that have been pre-trained on a mixture of many languages.…”
Section: Introductionmentioning
confidence: 99%
“…Our contribution in this paper is twofold: (1) To the best of our knowledge, this is the first work to apply Transformer networks for MDAS in Brazilian Portuguese. In particular, we fine-tune and compare recently created Transformer-based models, notably PTT5 [Carmo et al 2020] that is pre-trained on Portuguese data. (2) We also release the BRWac2Wiki dataset, automatically generated from thousands of pairs websites, W ikipedia , which is a milestone for the Portuguese MDAS.…”
Section: Introductionmentioning
confidence: 99%