Speech Shot Extraction From Broadcast News Videos

Kumagai, Seiji; Doman, Keisuke; Takahashi, Tomokazu; Deguchi, Daisuke; Ide, Ichiro; Murase, Hiroshi

doi:10.1142/s1793351x12400077

Cited by 5 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To solve this problem, as a special case, we should consider using the speaker's original voice when the selected shot contains a monologue, instead. This could be detected by, for example, Kumagai et al's method [17]. For sentence #3 ( Fig.…”

Section: Resultsmentioning

confidence: 98%

“…Although in their work, it is shown that this approach is e®ective to some extent, if we do not consider the more high-level visual contents actually present in a scene, it will limit the cases that it could handle properly. Recently, Kumagai et al attempted to detect such inconsistency in news videos based on the relation between audio-visual features [17], but it could only handle monologue (speech) scenes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Summarization of Multiple News Videos Considering the Consistency of Audio-Visual Contents

Zhang

Tanishige²,

Ide

et al. 2019

Int. J. Semantic Computing

Self Cite

View full text Add to dashboard Cite

News videos are valuable multimedia information on real-world events. However, due to the incremental nature of the contents, a sequence of news videos on a related news topic could be redundant and lengthy. Thus, a number of methods have been proposed for their summarization. However, there is a problem that most of these methods do not consider the consistency between the auditory and visual contents. This becomes a problem in the case of news videos, since both contents do not always come from the same source. Considering this, in this paper, we propose a method for summarizing a sequence of news videos considering the consistency of auditory and visual contents. The proposed method first selects key-sentences from the auditory contents (Closed Caption) of each news story in the sequence, and next selects a shot in the news story whose “Visual Concepts” detected from the visual contents are the most consistent with the selected key-sentence. In the end, the audio segment corresponding to each key-sentence is synthesized with the selected shot, and then these clips are concatenated into a summarized video. Results from subjective experiments on summarized videos on several news topics show the effectiveness of the proposed method.

show abstract

Section: Resultsmentioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

Summarization of Multiple News Videos Considering the Consistency of Audio-Visual Contents

Zhang

Tanishige²,

Ide

et al. 2019

Int. J. Semantic Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Thus, we have developed a method that automatically learns and excludes voices of the anchorperson and the reporters according to specific keywords in the CC [12] and a method that learns the correlation of the features between the lip shape and the audio [13], in order to detect monologue scenes. See corresponding references for details of the works.…”

Section: Detection Of Monologue Scenesmentioning

confidence: 99%