mT5: A massively multilingual pre-trained text-to-text transformer

Xue, Linting; Constant, Noah; Roberts, Adam; Kale, Mihir; Al‐Rfou, Rami; Siddhant, Aditya; Barua, Aditya; Raffel, Colin

doi:10.48550/arxiv.2010.11934

Cited by 97 publications

(103 citation statements)

References 36 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, T5 (Raffel et al, 2019) demonstrated that many language tasks previously addressed with separate models could be addressed using a single text-to-text encoder-decoder Transformer model. Extending this approach, mT5 (Xue et al, 2020) used a single Transformer to model multiple languages, demonstrating that a unified architecture could also serve as a general multilingual model, leveraging high-resource language datasets to improve model performance on lower-resource datasets.…”

Section: Transformers For Sequence Modelingmentioning

confidence: 99%

“…In addition to removing the cumbersome task of constructing specialized architectures and loss functions for different instrumentations and datasets, our general output vocabulary also allows our model to be trained on a mixture of several datasets simultaneously, similar to how multilingual translation models such as mT5 are trained on several languages (Xue et al, 2020). This approach not only simplifies model design and training, but also increases the amount and diversity of training data available to the model.…”

Section: Multi-task Mixturementioning

confidence: 99%

“…In order to balance model performance on low-and high-resource datasets, we use a temperature sampling strategy for the mixing as follows: if dataset i has n i examples, we sample an example from that dataset with probability (n i / j n j ) 0.3 , similar to mT5 (Xue et al, 2020). This has the effect of increasing the frequency with which the model observes examples from low-resource datasets during training, while observing examples from high-resource datasets with lower frequency.…”

Section: Multi-task Mixturementioning

confidence: 99%

“…Unified framework for training: We define a tokenization scheme with a compact and flexible vocabulary to convert between model output tokens and multitrack MIDI files, enabling a sequence-tosequence approach inspired by Raffel et al (2019) and Xue et al (2020) that supports datasets with different combinations of instruments. This allows us to simultaneously leverage several datasets which were previously only used in isolation due to differences in instrumentation.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MT3: Multi-Task Multitrack Music Transcription

Gardner¹,

Simon²,

Manilow³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT. 1

show abstract

Section: Transformers For Sequence Modelingmentioning

confidence: 99%

Section: Multi-task Mixturementioning

confidence: 99%

Section: Multi-task Mixturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

MT3: Multi-Task Multitrack Music Transcription

Gardner¹,

Simon²,

Manilow³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We decode the output by searching for occurrences of the predicted acronyms and long-forms and detecting their character spans in the input text. We use mT5 for our experiments (Xue et al 2021).…”

Section: Baselinesmentioning

confidence: 99%

CABACE: Injecting Character Sequence Information and Domain Knowledge for Enhanced Acronym and Long-Form Extraction

Kannen¹,

Sheth²,

Chandra³

et al. 2021

Preprint

View full text Add to dashboard Cite

Acronyms and long-forms are commonly found in research documents, more so in documents from scientific and legal domains. Many acronyms used in such documents are domain-specific, and are very rarely found in normal text corpora. Owing to this, transformer-based NLP models often detect OOV (Out of Vocabulary) for acronym tokens, especially for non-English languages, and their performance suffers while linking acronyms to their long forms during extraction. Moreover, pre-trained transformer models like BERT are not specialized to handle scientific and legal documents. With these points being the overarching motivation behind this work, we propose a novel framework CABACE: Character-Aware BERT for ACronym Extraction, which takes into account character sequences in text, and is adapted to scientific and legal domains by masked language modelling. We further use an objective with an augmented loss function, adding max loss and mask loss terms to the standard cross-entropy loss for training CABACE. We further leverage pseudo labelling and adversarial data generation to improve the generalizability of the framework. Experimental results prove the superiority of the proposed framework in comparison to various baselines. Additionally, we show that the proposed framework is better suited than baseline models for zero-shot generalization to non-English languages, thus reinforcing the effectiveness of our approach. Our team BacKGProp secured the highest scores on the French dataset, second-highest on Danish and Vietnamese, and third-highest in English-Legal dataset on the global leaderboard for the acronym extraction (AE) shared task at SDU AAAI-22.

show abstract

Text Summarisation for Low-Resourced Languages, A Review

Edwards,

Sefara

2024

Communications in Computer and Information Science

View full text Add to dashboard Cite

mT5: A massively multilingual pre-trained text-to-text transformer

Cited by 97 publications

References 36 publications

MT3: Multi-Task Multitrack Music Transcription

MT3: Multi-Task Multitrack Music Transcription

CABACE: Injecting Character Sequence Information and Domain Knowledge for Enhanced Acronym and Long-Form Extraction

Text Summarisation for Low-Resourced Languages, A Review

Contact Info

Product

Resources

About