GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems

Ding, Bosheng; Hu, Junjie; Aljunied, Sharifah Mahani; Joty, Shafiq; Si, Luo; Chen, Miao

doi:10.48550/arxiv.2110.07679

Cited by 2 publications

(3 citation statements)

References 18 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In DST, irrespective of the transfer method and target language, cross-lingual performance is nearzero (not shown). These findings are in line with prior work (Ding et al, 2021) and are due to the DST task complexity. This is even more pronounced in zero-shot cross-lingual settings and especially for COD, where culture-specific slot values are obtained via outline-based generation.…”

Section: Resultssupporting

confidence: 92%

“…However, even when available, these resources suffer from several pitfalls. Most are obtained by manual or semi-automatic translation of an English source Susanto and Lu, 2017;Upadhyay et al, 2018;Xu et al, 2020;Ding et al, 2021;Zuo et al, 2021, inter alia). While this process is cost-efficient and typically makes data and results comparable across languages, it yields dialogues that lack naturalness (Lembersky et al, 2012;Volansky et al, 2015), are not properly localised nor culture-specific (Clark et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Majewska¹,

Razumovskaia²,

Ponti³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multilingual task-oriented dialogue (TOD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets for multilingual TOD-both for modular and end-toend modelling-suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based TOD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual TOD datasets, where domainspecific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing a dialogue by providing instructions about each turn's intents and slots. Through this process we annotate a new large-scale dataset for training and evaluation of multilingual and cross-lingual TOD systems. Our Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding, dialogue state tracking, and end-toend dialogue modelling and evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of COD versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual TOD, setting reference scores for future work and demonstrating that COD prevents over-inflated performance, typically met with prior translation-based TOD datasets.

show abstract

Section: Resultssupporting

confidence: 92%

Section: Introductionmentioning

confidence: 99%

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Majewska¹,

Razumovskaia²,

Ponti³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There have been many studies on cross-lingual arXiv:2305.12480v1 [cs.CL] 21 May 2023 transfer for classification tasks (Hu et al, 2020;Jiang et al, 2020;Ruder et al, 2021;Ding et al, 2021). For generation tasks, however, much less attention has been paid to it and the results are far from satisfactory (Cao et al, 2020;Chen et al, 2021;Žagar and Robnik-Šikonja, 2021;Shen et al, 2023).…”

Section: Introductionmentioning

confidence: 99%

Cross-lingual Sentiment Analysis via AAE and BiGRU

Shen

Liu

Shuai

2020

2020 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC)

View full text Add to dashboard Cite

Cross-lingual transfer is important for developing high-quality chatbots in multiple languages due to the strongly imbalanced distribution of language resources. A typical approach is to leverage off-the-shelf machine translation (MT) systems to utilize either the training corpus or developed models from highresource languages. In this work, we investigate whether it is helpful to utilize MT at all in this task. To do so, we simulate a low-resource scenario assuming access to limited Chinese dialog data in the movie domain and large amounts of English dialog data from multiple domains. Experiments show that leveraging English dialog corpora can indeed improve the naturalness, relevance and cross-domain transferability in Chinese. However, directly using English dialog corpora in its original form, surprisingly, is better than using its translated version. As the topics and wording habits in daily conversations are strongly culture-dependent, MT can reinforce the bias from high-resource languages, yielding unnatural generations in the target language. Considering the cost of translating large amounts of text and the strong effects of the translation quality, we suggest future research should rather focus on utilizing the original English data for cross-lingual transfer in dialog generation. We perform extensive human evaluations and ablation studies. The analysis results, together with the collected dataset, are presented to draw attention towards this area and benefit future research 1 .

show abstract

GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems

Cited by 2 publications

References 18 publications

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Cross-lingual Sentiment Analysis via AAE and BiGRU

Contact Info

Product

Resources

About