News videos are valuable multimedia information on real-world events. However, due to the incremental nature of the contents, a sequence of news videos on a related news topic could be redundant and lengthy. Thus, a number of methods have been proposed for their summarization. However, there is a problem that most of these methods do not consider the consistency between the auditory and visual contents. This becomes a problem in the case of news videos, since both contents do not always come from the same source. Considering this, in this paper, we propose a method for summarizing a sequence of news videos considering the consistency of auditory and visual contents. The proposed method first selects key-sentences from the auditory contents (Closed Caption) of each news story in the sequence, and next selects a shot in the news story whose “Visual Concepts” detected from the visual contents are the most consistent with the selected key-sentence. In the end, the audio segment corresponding to each key-sentence is synthesized with the selected shot, and then these clips are concatenated into a summarized video. Results from subjective experiments on summarized videos on several news topics show the effectiveness of the proposed method.