MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

FitzGerald, Jack; Hench, Christopher; Peris, Charith; Mackie, Scott; Rottmann, Kay; Sanchez, Ana M.; Nash, Aaron; Urbach, Liam; Kakarala, Vishesh; Singh, Rajesh; Swetha, Ranganath,; Crist, Laurie; Britan, Misha; Leeuwis, Wouter; Tür, Gökhan; Natarajan, Prem

doi:10.48550/arxiv.2204.08582

Cited by 11 publications

(23 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we can gauge the method's efficacy independent of any potential biases towards our own specific data. We also perform the same experiments on the Farsi-translated section of the Massive [15] corpus to gain a better understanding of model's performance on Persian. So, Firstly, we fine-tune conditional BERT with our selected set which includes 79 slot types from ATIS dataset over 10 epochs with batch-size of 8.…”

Section: Assessment Of Full-automatic Augmentation Methodsmentioning

confidence: 99%

“…Consequently, the final corpus consisted of 3,000 automated dialogues and 600 semi-automated dialogues, resulting in 117 intents and 262 slots. MASSIVE dataset [15] as a part of Multilingual Amazon Slu resource package (SLURP) which was developed for Slot-filling and Intent classification, can be regarded as another source in Persian. It contains 1 million realistic, parallel, labeled virtual assistant utterances including 51 languages, 18 domains, 60 intents, and 55 slots.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

Safari,

Shamsfard

2024

IEEE Access

View full text Add to dashboard Cite

In this paper, we describe data preparation for our proposed chatbot PerInfEx (Persian Information Extraction chatbot). It aims to interactively chit-chat with users in Persian and by asking the least number of direct questions, extract as much personal information as possible such as user's age or occupation. Collecting data in considerable size and aligned with our system's specifics is a crucial step to train data-hungry modules of Natural Language Understating (NLU) and Natural Language Generating (NLG). Initially, for NLU module, we collect 99 free-discussion dialogues and crawl 74 English training conversations as more-general datasets while also manually translate 72 dialogues of ConvAI2 corpus. Moreover, we gamify collection by implementing a chatting website results in 94 dialogues. It detects direct questions and assigns random profiles to participants. They should guess the opponents profile. Also, we propose two augmentation methods: a semi-automatic and a novel fully automatic method, comprehensively evaluated on NLU benchmarks and applied on our datasets. Also, by prompting OpenAI's GPT-3.5 model, we automatically generate 304 dialogues. The first part of these datasets is manually annotated while we use an active learning method for annotating rest of them. Next, to evaluate data quality, we assess them extrinsically using NLU baseline which results in intent-accuracy=88.64, slot-F1=83.68 and exact-match=78.22. Also, for NLG module, we automatically translate almost the rest of ConvAI2 corpus (16,217 dialogues) and paraphrase previously sets for its fine-tuning using GPT-3.5 model. Their assessment using our NLG baseline results in perplexity of 15.74 on train and 52.17 on test set.

show abstract

Section: Assessment Of Full-automatic Augmentation Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

Safari,

Shamsfard

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…We use a mixture of accents originating from non-native English speakers to resemble a real-world scenario. Voice assistants do not support the majority of the world's languages [4]. Therefore, many users have to voice their questions in a language different from their native one.…”

Section: Natural Asr Noisementioning

confidence: 99%

“…Such voice assistants do not only increase the convenience with which users can query them but can support users with visual and motor impairments for which the use of conventional text entry mechanisms (keyboard) is not applicable [11]. Despite the popularity of voice assistants among users globally and the advancements in spoken-language understanding [2,4], there are surprisingly limited efforts in studying spoken QA and its limitations.…”

Section: Introductionmentioning

confidence: 99%

On the Impact of Speech Recognition Errors in Passage Retrieval for Spoken Question Answering

Sidiropoulos

Vakulenko²,

Kanoulas

2022

Proceedings of the 31st ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Interacting with a speech interface to query a Question Answering (QA) system is becoming increasingly popular. Typically, QA systems rely on passage retrieval to select candidate contexts and reading comprehension to extract the final answer. While there has been some attention to improving the reading comprehension part of QA systems against errors that automatic speech recognition (ASR) models introduce, the passage retrieval part remains unexplored. However, such errors can affect the performance of passage retrieval, leading to inferior end-to-end performance. To address this gap, we augment two existing large-scale passage ranking and open domain QA datasets with synthetic ASR noise and study the robustness of lexical and dense retrievers against questions with ASR noise. Furthermore, we study the generalizability of data augmentation techniques across different domains; with each domain being a different language dialect or accent. Finally, we create a new dataset with questions voiced by human users and use their transcriptions to show that the retrieval performance can further degrade when dealing with natural ASR noise instead of synthetic ASR noise. CCS CONCEPTS• Information systems → Retrieval models and ranking.

show abstract

“…In recognition of this, new efforts have started to be undertaken to ensure that a diversity of languages and cultural contexts are represented. For example, the Amazon Alexa team have been spearheading a "massive" crowdsourced translation and localization initiative of their MASSIVE data set into 51 languages [19], including a global competition 1 . Indeed, most of these initiatives rely on crowdsourcing activities.…”

Section: Introductionmentioning

confidence: 99%

“I'm” Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets

Seaborn

Kim

2023

Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

As virtual assistants continue to be taken up globally, there is an ever-greater need for these speech-based systems to communicate naturally in a variety of languages. Crowdsourcing initiatives have focused on multilingual translation of big, open data sets for use in natural language processing (NLP). Yet, language translation is often not one-to-one, and biases can trickle in. In this late-breaking work, we focus on the case of pronouns translated between English and Japanese in the crowdsourced Tatoeba database. We found that masculine pronoun biases were present overall, even though plurality in language was accounted for in other ways. Importantly, we detected biases in the translation process that reflect nuanced reactions to the presence of feminine, neutral, and/or non-binary pronouns. We raise the issue of translation bias for pronouns and offer a practical solution to embed plurality in NLP data sets.

show abstract

MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

Cited by 11 publications

References 0 publications

Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

Data Augmentation and Preparation Process of PerInfEx: A Persian Chatbot With the Ability of Information Extraction

On the Impact of Speech Recognition Errors in Passage Retrieval for Spoken Question Answering

“I'm” Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets

Contact Info

Product

Resources

About