MOSS: End-to-End Dialog System Framework with Modular Supervision

Liang, Weixin; Tian, Youzhi; Chen, Chengcai; Yu, Zhou

doi:10.1609/aaai.v34i05.6349

Cited by 39 publications

(28 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One major difference between our Graph Reasoning Module and standard GNN is that, we want the message passing in layer L conditioned on the L th instruction vector. Inspired by language model type condition (Liang et al, 2020b), we adopt a general design that is compatible with any graph neural network design: Before running the L th GNN layer, we concatenate the L th instruction vector to every node and edge feature from the previous layer. Specifically,…”

Section: Graph Reasoning Modulementioning

confidence: 99%

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Liang¹,

Jiang²,

Liu³

2021

Proceedings of the Third Workshop on Multimodal Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

Images are more than a collection of objects or attributes -they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality as a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passing among graph nodes. We explore the design space of GraphVQA framework, and discuss the trade-off of different design choices. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art model by a large margin (88.43% vs. 94.78%).

show abstract

Section: Graph Reasoning Modulementioning

confidence: 99%

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Liang¹,

Jiang²,

Liu³

2021

Proceedings of the Third Workshop on Multimodal Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

show abstract

“…We borrow some elements from the Sequicity ) model, such as representing the belief state as a natural language sequence (a text span), and using copy-augmented Seq2Seq learning (Gu et al, 2016). But compared to Sequicity and all its follow-up works Shu et al, 2019;Liang et al, 2020), a feature in our LABES-S2S model is that the transition between belief states across turns and the dependency between system responses and belief states are well statistically modeled. This new design results in a completely different graphical model structure, which enables rigorous probabilistic variational learning.…”

Section: Related Workmentioning

confidence: 99%

“…E2E Models: E2E models can be divided into three sub-categories. The TSCP , SEDST , FSDM (Shu et al, 2019), MOSS (Liang et al, 2020) and DAMD are based on the copy-augmented Seq2Seq learning framework proposed by . LIDM (Wen et al, 2017a), SFN (Mehri et al, 2019) and UniConv (Le et al, 2020a) are modular designed, connected through neural states and trained end-to-end.…”

Section: Baselinesmentioning

confidence: 99%

See 1 more Smart Citation

A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

Zhang¹,

Ou²,

Hu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Structured belief states are crucial for user goal tracking and database query in task-oriented dialog systems. However, training belief trackers often requires expensive turn-level annotations of every user utterance. In this paper we aim at alleviating the reliance on belief state labels in building end-to-end dialog systems, by leveraging unlabeled dialog data towards semi-supervised learning. We propose a probabilistic dialog model, called the LAtent BElief State (LABES) model, where belief states are represented as discrete latent variables and jointly modeled with system responses given user inputs. Such latent variable modeling enables us to develop semi-supervised learning under the principled variational learning framework. Furthermore, we introduce LABES-S2S, which is a copyaugmented Seq2Seq model instantiation of LABES 1 . In supervised experiments, LABES-S2S obtains strong results on three benchmark datasets of different scales. In utilizing unlabeled dialog data, semi-supervised LABES-S2S significantly outperforms both supervisedonly and semi-supervised baselines. Remarkably, we can reduce the annotation demands to 50% without performance loss on MultiWOZ.

show abstract

“…1(b) shows an O2O model with a conditional chain mapping. This method for multiple sequence modeling has been applied to dialog modeling (Liang et al, 2020), speaker diarization (Fujita et al, 2020a), and multi-speaker ASR (Shi et al, 2020). Unlike the O2M model, this model can predict a variable number of output sequences while explicitly considering dependencies between the multiple sequences based on the probabilistic chain rule.…”

Section: Introductionmentioning

confidence: 99%

End-to-end ASR to jointly predict transcriptions and linguistic annotations

Omachi¹,

Fujita²,

Watanabe³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We propose a Transformer-based sequence-tosequence model for automatic speech recognition (ASR) capable of simultaneously transcribing and annotating audio with linguistic information such as phonemic transcripts or part-of-speech (POS) tags. Since linguistic information is important in natural language processing (NLP), the proposed ASR is especially useful for speech interface applications, including spoken dialogue systems and speech translation, which combine ASR and NLP. To produce linguistic annotations, we train the ASR system using modified training targets: each grapheme or multi-grapheme unit in the target transcript is followed by an aligned phoneme sequence and/or POS tag. Since our method has access to the underlying audio data, we can estimate linguistic annotations more accurately than pipeline approaches in which NLP-based methods are applied to a hypothesized ASR transcript. Experimental results on Japanese and English datasets show that the proposed ASR system is capable of simultaneously producing highquality transcriptions and linguistic annotations.

show abstract

MOSS: End-to-End Dialog System Framework with Modular Supervision

Cited by 39 publications

References 13 publications

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

End-to-end ASR to jointly predict transcriptions and linguistic annotations

Contact Info

Product

Resources

About