Live video comments, or "danmu", are an emerging social feature on Asian online video platforms. These time-synchronous comments are overlaid on the video playback and uniquely enrich the viewing experience, engaging hundreds of millions of users in rich community discussions. The presence of danmu comments has become a determining factor for video popularity. Recent work has proposed a model to automatically generate comments, but very little work has so far considered the problem of where to insert the comments in the video timeline. In this work, we propose to address both the what and where of automatic danmu generation, by jointly predicting the danmu comment content to be generated, as well as its optimal insertion point in the video timeline. Our model exploits the video visual content, subtitles, audio signals, and any existing surrounding comments, in one unified architecture and can handle scenarios where the videos are already heavily commented or when the video has no comments yet. Experiments show that our proposed unified framework is in general observed to outperform state-of-the-art comment generation methods.
CCS CONCEPTS• Computing methodologies → Natural language generation; Activity recognition and understanding.