PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Carmo, Diedre; Piau, Marcos; Campiotti, Israel; Nogueira, Rodrigo; Lotufo, Roberto de Alencar

doi:10.48550/arxiv.2008.09144

Cited by 11 publications

(19 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Rows 1 to 6 in Table 3 present the results. We experimented with distinct T5 Base models: the original T5 Base with English tokenizer (row 1), with further pretraining on Portuguese corpus [7] but using the English tokenizer (row 2), and PTT5 Base (row 3). We notice that the adoption of a Portuguese tokenizer by PTT5 provided an error reduction in EM of 70.3% over the previous experiment that used the same pretraining dataset (rows 2 vs 3).…”

Section: Resultsmentioning

confidence: 99%

“…We work with texts in Portuguese, so this English model acts as a weak baseline. PTT5: It consists of a T5 model, without changes in the architecture, pretrained on a large Brazilian Portuguese corpus [7]. We employ PTT5 Base which showed to be not only more efficient but also more effective than the PTT5 Large [7].…”

Section: Modelsmentioning

confidence: 99%

“…PTT5: It consists of a T5 model, without changes in the architecture, pretrained on a large Brazilian Portuguese corpus [7]. We employ PTT5 Base which showed to be not only more efficient but also more effective than the PTT5 Large [7]. PTT5-Legal: To adapt PTT5 to in-domain text, we further pretrain it on a large corpus that comprises two out of the four document types we explore in this work: registries of property and legal publications.…”

Section: Modelsmentioning

confidence: 99%

See 2 more Smart Citations

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Pires¹,

Souza²,

Rosa³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

A typical information extraction pipeline consists of tokenor span-level classification models coupled with a series of pre-and postprocessing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Pires¹,

Souza²,

Rosa³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This significantly limits their use given that roughly 80% of the world population does not speak English (Crystal, 2008). One way the community has addressed this English-centricity has been to release dozens of models that have instead been pre-trained on a single non-English language (Carmo et al, 2020;de Vries et al, 2019;Le et al, 2019;Martin et al, 2019;Delobelle et al, 2020;Malmsten et al, 2020;Nguyen and Nguyen, 2020;Polignano et al, 2019, etc.). A more general solution is to produce multilingual models that have been pre-trained on a mixture of many languages.…”

Section: Introductionmentioning

confidence: 99%

mT5: A massively multilingual pre-trained text-to-text transformer

Xue

Constant

Roberts

et al. 2020

Preprint

108

View full text Add to dashboard Cite

The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its stateof-the-art performance on many multilingual benchmarks. All of the code and model checkpoints used in this work are publicly available. 1

show abstract

“…Our contribution in this paper is twofold: (1) To the best of our knowledge, this is the first work to apply Transformer networks for MDAS in Brazilian Portuguese. In particular, we fine-tune and compare recently created Transformer-based models, notably PTT5 [Carmo et al 2020] that is pre-trained on Portuguese data. (2) We also release the BRWac2Wiki dataset, automatically generated from thousands of pairs websites, W ikipedia , which is a milestone for the Portuguese MDAS.…”

Section: Introductionmentioning

confidence: 99%

PLSUM: Generating PT-BR Wikipedia by Summarizing Multiple Websites

Oliveira,

Costa

2021

Preprint

View full text Add to dashboard Cite

Wikipedia is an important free source of intelligible knowledge. Despite that, Brazilian Portuguese Wikipedia still lacks descriptions for many subjects. In an effort to expand the Brazilian Wikipedia, we contribute PLSum, a framework for generating wiki-like abstractive summaries from multiple descriptive websites. The framework has an extractive stage followed by an abstractive one. In particular, for the abstractive stage, we fine-tune and compare two recent variations of the Transformer neural network, PTT5, and Longformer. To fine-tune and evaluate the model, we created a dataset with thousands of examples, linking reference websites to Wikipedia. Our results show that it is possible to generate meaningful abstractive summaries from Brazilian Portuguese web content.1 Wikipedia numbers extracted from https://pt.wikipedia.org/wiki/Wikipedia, visited in 15/08/2021.

show abstract

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Cited by 11 publications

References 12 publications

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

mT5: A massively multilingual pre-trained text-to-text transformer

PLSUM: Generating PT-BR Wikipedia by Summarizing Multiple Websites

Contact Info

Product

Resources

About