A Large-Scale Corpus for Conversation Disentanglement

Kummerfeld, Jonathan K.; Gouravajhala, Sai R.; Peper, Joseph; Athreya, Vignesh; Gunasekara, R. Chulaka; Ganhotra, Jatin; Patel, Siva Sankalp; Polymenakos, Lazaros; Lasecki, Walter S.

doi:10.18653/v1/p19-1374

Cited by 59 publications

(82 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Several researchers have defined tasks related to discourse structure, including sentence ordering (Chen et al, 2016;Logeswaran et al, 2016;Cui et al, 2018), sentence clustering (Wang et al, 2018b), and disentangling textual threads (Elsner andCharniak, 2008, 2010;Lowe et al, 2015;Mehri and Carenini, 2017;Jiang et al, 2018;Kummerfeld et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations

Chen¹,

Chu²,

Gimpel³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Prior work on pretrained sentence embeddings and benchmarks focuses on the capabilities of representations for stand-alone sentences. We propose DiscoEval, a test suite of tasks to evaluate whether sentence representations include information about the role of a sentence in its discourse context. We also propose a variety of training objectives that make use of natural annotations from Wikipedia to build sentence encoders capable of modeling discourse information. We benchmark sentence encoders trained with our proposed objectives, as well as other popular pretrained sentence encoders, on DiscoEval and other sentence evaluation tasks. Empirically, we show that these training objectives help to encode different aspects of information from the surrounding document structure. Moreover, BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018a) demonstrate strong performance across DiscoEval tasks with individual hidden layers showing different characteristics. 1 * Equal contribution. Listed in alphabetical order. 1 Data processing and evaluation scripts are available at https://github.com/ZeweiChu/DiscoEval.

show abstract

Section: Related Workmentioning

confidence: 99%

Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations

Chen¹,

Chu²,

Gimpel³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…However, these methods heavily rely on hand-engineered features that are often too specific to the particular datasets (or domains) on which the model is trained and evaluated. For example, many of the features used in (Kummerfeld et al, 2019) are only applicable to the Ubuntu IRC dataset. This hinders the model's generalization and adaptability to other domains.…”

Section: Introductionmentioning

confidence: 99%

Online Conversation Disentanglement with Pointer Networks

Yu¹,

Joty²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Huge amounts of textual conversations occur online every day, where multiple conversations take place concurrently. Interleaved conversations lead to difficulties in not only following the ongoing discussions but also extracting relevant information from simultaneous messages. Conversation disentanglement aims to separate intermingled messages into detached conversations. However, existing disentanglement methods rely mostly on handcrafted features that are dataset specific, which hinders generalization and adaptability. In this work, we propose an end-to-end online framework for conversation disentanglement that avoids time-consuming domain-specific feature engineering. We design a novel way to embed the whole utterance that comprises timestamp, speaker, and message text, and propose a custom attention mechanism that models disentanglement as a pointing problem while effectively capturing inter-utterance interactions in an end-to-end fashion. We also introduce a joint-learning objective to better capture contextual information. Our experiments on the Ubuntu IRC dataset show that our method achieves state-of-the-art performance in both link and conversation prediction tasks.

show abstract

“…LRL Corpora -Social Media: Today, social media is a rich source to develop text corpora for different NLP tools [71]- [75]. Leveraging the content of these social media platforms, Cross-Lingual Arabic Blog Alerts (COLBA) [76] project has focused on collecting Arabic content from different social media platforms like blogs, discussion forums, and chats to develop NLP tools.…”

Section: Related Workmentioning

confidence: 99%