LCSTS: A Large Scale Chinese Short Text Summarization Dataset

Hu, Baotian; Chen, Qingcai; Zhu, Fangze

doi:10.18653/v1/d15-1229

Cited by 238 publications

(220 citation statements)

References 17 publications

Supporting

Mentioning

219

Contrasting

Unclassified

Order By: Relevance

“…Additionally, we also report other standard language generation metrics (as motivated recently by ): METEOR (Denkowski and Lavie, 2014), BLEU-4 (Papineni et al, 2002), and CIDEr-D (Vedantam et al, 2015), based on the MS-COCO evaluation script (Chen et al, 2015).…”

Section: Discussionmentioning

confidence: 99%

“…Automatic abstractive summarization can be considered one of the most challenging variants of automatic summarization (Gambhir and Gupta, 2017). But with recent advancements in the field of deep learning, new ground was broken using various kinds of neural network models Hu et al, 2015;Chopra et al, 2016;. The performance of these kinds of summarization models strongly depends on large amounts of suitable training data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Proceedings of the Workshop on New Frontiers in Summarization

2017

View full text Add to dashboard Cite

With the prevalence of video sharing, there are increasing demands for automatic video digestion such as highlight detection. Recently, platforms with crowdsourced time-sync video comments have emerged worldwide, providing a good opportunity for highlight detection. However, this task is non-trivial: (1) time-sync comments often lag behind their corresponding shot; (2) time-sync comments are semantically sparse and noisy; (3) to determine which shots are highlights is highly subjective. The present paper aims to tackle these challenges by proposing a framework that (1) uses concept-mapped lexical-chains for lagcalibration; (2) models video highlights based on comment intensity and combination of emotion and concept concentration of each shot; (3) summarize each detected highlight using improved SumBasic with emotion and concept mapping. Experiments on large real-world datasets show that our highlight detection method and summarization method both outperform other benchmarks with considerable margins. IntroductionEvery day, people watch billions of hours of videos on YouTube, with half of the views on mobile devices 1 . With the prevalence of video shar-1 https://www.youtube.com/yt/press/statistics.html ing, there is increasing demand for fast video digestion. Imagine a scenario where a user wants to quickly grasp a long video, without dragging the progress bar repeatedly to skip shots unappealing to the user. With automatically-generated highlights, users could digest the entire video in minutes, before deciding whether to watch the full video later. Moreover, automatic video highlight detection and summarization could benefit video indexing, video search and video recommendation. However, finding highlights from a video is not a trivial task. First, what is considered to be a "highlight" can be very subjective. Second, a highlight may not always be captured by analyzing low-level features in image, audio and motions. Lack of abstract semantic information has become a bottleneck of highlight detection in traditional video processing.Recently, crowdsourced time-sync video comments, or "bullet-screen comments" have emerged, where real-time generated comments will be flying over or besides the screen, synchronized with the video frame by frame. It has gained popularity worldwide, such as niconico in Japan, Bilibili and Acfun in China, YouTube Live and Twitch Live in USA. The popularity of the timesync comments has suggested new opportunities for video highlight detection based on natural language processing.Nevertheless, it is still a challenge to detect and label highlights using time-sync comments. First, there is almost inevitable lag for comments related to each shot. As in Figure 1, ongoing discussion about one shot may extend to next a few shots. Highlight detection and labeling without lagcalibration may cause inaccurate results. Second, 1 time-sync comments are sparse semantically, both in number of comments per shot and number of tokens per comment. Traditionally bag-of-words statistical model may...

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Proceedings of the Workshop on New Frontiers in Summarization

2017

View full text Add to dashboard Cite

show abstract

“…Next to the English resources listed in Table 1, the LCSTS dataset collected by Hu et al (2015) is perhaps closest to our own work-both in terms of text genre and collection method. Their dataset comprises 2.5 million content-summary pairs collected from the Chinese social media platform Weibo, a service similar to Twitter in that a post is limited to 140 characters.…”

Section: Related Workmentioning

confidence: 99%

“…Automatic abstractive summarization can be considered one of the most challenging variants of automatic summarization (Gambhir and Gupta, 2017). But with recent advancements in the field of deep learning, new ground was broken using various kinds of neural network models (Rush et al, 2015;Hu et al, 2015;Chopra et al, 2016;See et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

TL;DR: Mining Reddit to Learn Automatic Summarization

Völske

Potthast

Syed

et al. 2017

Proceedings of the Workshop on New Frontiers in Summarization

View full text Add to dashboard Cite

Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a "TL;DR" to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 corpus, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.

show abstract

“…K Lopvrev et al built an abstract generation model based on the encoder-decoder framework in 2015 by using RNN(Recurrent Neural Network) with unit of LSTM(Long Short-Term Memory) [5] and used an attention mechanism to generate news headlines [6].Secondly, the two papers [7,8] published by Rush et al from the Facebook Artificial Intelligence Research Institute from 2015 to 2016 to solve the text abstract generation task, based on the Encoder-Decoder architecture, proposed different encoder approaches based CNN(Convolutional Neural Network) and attention mechanisms, and decoder architecture based on the RNNLM(Recurrent Neural Network Language Model). Hu et al [9] applied RNN-based Encoder-Decoder architecture to Chinese text digest tasks and constructed a Chinese text digest dataset LCSTS to facilitate the study of Chinese comprehension abstracts. This paper mainly studies sentence-level Chinese short text comprehension abstract generation tasks and builds a summary generation model based on LCSTS data sets.…”

Section: Introductionmentioning

confidence: 99%

Chinese Short Text Summary Generation Model Combining Global and Local Information

Chen¹

2018

Proceedings of the 2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018)

View full text Add to dashboard Cite

Abstract. Short text comprehension summary generation is currently a hot issue. In this paper, we improve the attention mechanism under the framework of encoder-decoder and proposes a comprehensible short text abstract generation model that integrates the global and local semantic information. The model consists of a dual encoder and a decoder. The dual encoder structure can combine the global and local semantic information and fully obtain the abstract features of the original text. And the improved mechanism can adaptively combine all information of short text to provide the input with summary characteristics for the decoder, so that the decoder can more accurately focus on the core content of the source text. In this paper, LCSTS dataset is used to train and test the model. The experimental results show that compared with the Seq2Seq and Seq2Seq with standard attention models, the proposed method can produce high-quality summary which consists of less repetitive words and performs better evaluation value in ROUGE.

show abstract

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

Cited by 238 publications

References 17 publications

Proceedings of the Workshop on New Frontiers in Summarization

Proceedings of the Workshop on New Frontiers in Summarization

TL;DR: Mining Reddit to Learn Automatic Summarization

Chinese Short Text Summary Generation Model Combining Global and Local Information

Contact Info

Product

Resources

About