Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue

Razumovskaia, Evgeniia; Vulić, Ivan; Korhonen, Anna

doi:10.18653/v1/2022.findings-acl.160

Cited by 5 publications

(4 citation statements)

References 79 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The goal in Step 1 is to ensure high recall (i.e., to avoid too aggressive filtering), which is why we opt for the earlier stopping. As a baseline, we fine-tune XLM-R for the token classification task, as the standard SL task format (Xu et al, 2020;Razumovskaia et al, 2022b). Detailed training hyperparameters are provided in Appendix B.…”

Section: Methodsmentioning

confidence: 99%

“…The current approach to mitigate the issue is the standard cross-lingual transfer. The main 'transfer' assumption is that a suitable large English annotated dataset is always available for a particular task and domain: (i) the systems are then trained on the English data and then directly deployed to the target language (i.e., zero-shot transfer), or (ii) further adapted to the target language relying on a small set of target language examples (Xu et al, 2020;Razumovskaia et al, 2022b) which are combined with the large English dataset (i.e., few-shot transfer). However, this assumption might often be unrealistic in the context of TOD due to a large number of potential tasks and domains that should be supported by TOD systems .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Transfer-Free Data-Efficient Multilingual Slot Labeling

Razumovskaia,

Vulić,

Korhonen

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Slot labeling (SL) is a core component of taskoriented dialogue (TOD) systems, where slots and corresponding values are usually language-, task-and domain-specific. Therefore, extending the system to any new language-domaintask configuration requires (re)running an expensive and resource-intensive data annotation process. To mitigate the inherent data scarcity issue, current research on multilingual ToD assumes that sufficient English-language annotated data are always available for particular tasks and domains, and thus operates in a standard cross-lingual transfer setup. In this work, we depart from this often unrealistic assumption. We examine challenging scenarios where such transfer-enabling English annotated data cannot be guaranteed, and focus on bootstrapping multilingual data-efficient slot labelers in transfer-free scenarios directly in the target languages without any English-ready data. We propose a two-stage slot labeling approach (termed TWOSL) which transforms standard multilingual sentence encoders into effective slot labelers. In Stage 1, relying on SL-adapted contrastive learning with only a handful of SLannotated examples, we turn sentence encoders into task-specific span encoders. In Stage 2, we recast SL from a token classification into a simpler, less data-intensive span classification task. Our results on two standard multilingual TOD datasets and across diverse languages confirm the effectiveness and robustness of TWOSL. It is especially effective for the most challenging transfer-free few-shot setups, paving the way for quick and data-efficient bootstrapping of multilingual slot labelers for TOD.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Transfer-Free Data-Efficient Multilingual Slot Labeling

Razumovskaia,

Vulić,

Korhonen

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Firstly, several automated approaches to reduce human effort were proposed, predominantly focusing on data augmentation. These approaches, previously applied to multilingual dialogue, include (i) word or span substitution, creating code-switched data between source and target languages Krishnan et al, 2021;Qin et al, 2021) or synonymous span substitution in a target language (Louvan & Magnini, 2020b)2020; and (ii) semi-supervised training with target language sentence retrieval (Razumovskaia et al, 2022).…”

Section: Outlook For Multilingual Tod Datasetsmentioning

confidence: 99%

Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems

Razumovskaia

Glavaš

Majewska

et al. 2022

jair

Self Cite

View full text Add to dashboard Cite

In task-oriented dialogue (ToD), a user holds a conversation with an artificial agent with the aim of completing a concrete task. Although this technology represents one of the central objectives of AI and has been the focus of ever more intense research and development efforts, it is currently limited to a few narrow domains (e.g., food ordering, ticket booking) and a handful of languages (e.g., English, Chinese). This work provides an extensive overview of existing methods and resources in multilingual ToD as an entry point to this exciting and emerging field. We find that the most critical factor preventing the creation of truly multilingual ToD systems is the lack of datasets in most languages for both training and evaluation. In fact, acquiring annotations or human feedback for each component of modular systems or for data-hungry end-to-end systems is expensive and tedious. Hence, state-of-the-art approaches to multilingual ToD mostly rely on (zero- or few-shot) cross-lingual transfer from resource-rich languages (almost exclusively English), either by means of (i) machine translation or (ii) multilingual representations. These approaches are currently viable only for typologically similar languages and languages with parallel / monolingual corpora available. On the other hand, their effectiveness beyond these boundaries is doubtful or hard to assess due to the lack of linguistically diverse benchmarks (especially for natural language generation and end-to-end evaluation). To overcome this limitation, we draw parallels between components of the ToD pipeline and other NLP tasks, which can inspire solutions for learning in low-resource scenarios. Finally, we list additional challenges that multilinguality poses for related areas (such as speech, fluency in generated text, and human-centred evaluation), and indicate future directions that hold promise to further expand language coverage and dialogue capabilities of current ToD systems.

show abstract

“…Due to the lack of OOS annotations in openworld settings, previous research usually detects OOS samples indirectly such as resorting to inscope (INS) samples. Recently, data augmentation methods (Ng et al, 2020;Razumovskaia et al, 2022) have made it possible to detect OOS directly using a binary classifier.…”

Section: Introductionmentioning

confidence: 99%

SILVER: Self Data Augmentation for Out-of-Scope Detection in Dialogues

Ma,

Makino

2023

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacifi

View full text Add to dashboard Cite

Detecting out-of-scope (OOS) utterances is crucial in task-oriented dialogue systems, but obtaining enough annotated OOS dialogues to train a binary classifier directly is difficult in practice. Existing data augmentation methods generate OOS dialogues automatically, but their performance usually depends on an external corpus. This dependence not only induces uncertainty, but also reduces the quality of generated dialogues. Specifically, all of them are out-of-domain (OOD).Herein we propose SILVER, a self data augmentation method that does not use external data. It addresses issues of previous research and improves the accuracy of OOS detection (false positive rate: 90.5% → 47.4%). Furthermore, SILVER successfully generates highquality in-domain (IND) OOS dialogues in terms of naturalness (percentage: 8% → 68%) and OOS correctness (percentage: 74% → 88%), as evaluated by human workers.

show abstract

Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue

Cited by 5 publications

References 79 publications

Transfer-Free Data-Efficient Multilingual Slot Labeling

Transfer-Free Data-Efficient Multilingual Slot Labeling

Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems

SILVER: Self Data Augmentation for Out-of-Scope Detection in Dialogues

Contact Info

Product

Resources

About