Abstract:Despite the ready availability of digital recording technology and the continually decreasing cost of digital storage, browsing audio recordings remains a tedious task. This paper presents evidence in support of a system designed to assist with information comprehension and retrieval tasks from a large collection of recorded speech. Two techniques are employed to assist users with these tasks. First, a speech recognizer creates necessarily error-laden transcripts of the recorded speech. Second, audio playback … Show more
“…In Vermuri et al [9], an audio playback interface was tested using recognition results with and without confidence visualization. No difference in users' comprehension rate was found.…”
Section: Related Workmentioning
confidence: 99%
“…[1,8,9]), in this paper, we focus on the first part of the correction problem only: finding errors. Detection of errors can be tricky for users as errors made by a recognizer are all valid words in a language.…”
In a typical speech dictation interface, the recognizer's bestguess is displayed as normal, unannotated text. This ignores potentially useful information about the recognizer's confidence in its recognition hypothesis. Using a confidence measure (which itself may sometimes be inaccurate), we investigated providing visual feedback about low-confidence portions of the recognition using shaded, red underlining. An evaluation showed, compared to a baseline without underlining, underlining lowconfidence areas did not increase user's speed or accuracy in detecting errors. However, we found that when recognition errors were correctly underlined, they were discovered significantly more often than baseline. Conversely, when errors failed to be underlined, they were discovered less often. Our results indicate confidence visualization can be effective -but only if the confidence measure has high accuracy. Further, since our results show that users tend to trust confidence visualization, designers should be careful in its application if a high accuracy confidence measure is not available.
“…In Vermuri et al [9], an audio playback interface was tested using recognition results with and without confidence visualization. No difference in users' comprehension rate was found.…”
Section: Related Workmentioning
confidence: 99%
“…[1,8,9]), in this paper, we focus on the first part of the correction problem only: finding errors. Detection of errors can be tricky for users as errors made by a recognizer are all valid words in a language.…”
In a typical speech dictation interface, the recognizer's bestguess is displayed as normal, unannotated text. This ignores potentially useful information about the recognizer's confidence in its recognition hypothesis. Using a confidence measure (which itself may sometimes be inaccurate), we investigated providing visual feedback about low-confidence portions of the recognition using shaded, red underlining. An evaluation showed, compared to a baseline without underlining, underlining lowconfidence areas did not increase user's speed or accuracy in detecting errors. However, we found that when recognition errors were correctly underlined, they were discovered significantly more often than baseline. Conversely, when errors failed to be underlined, they were discovered less often. Our results indicate confidence visualization can be effective -but only if the confidence measure has high accuracy. Further, since our results show that users tend to trust confidence visualization, designers should be careful in its application if a high accuracy confidence measure is not available.
“…In discussing recorded speech, Vemuri and colleagues discuss one reason why: aural speech delivery presents unique challenges [17]. The average speech rate of an English speaker is over twice as slow as the average reading rate.…”
Section: Introductionmentioning
confidence: 99%
“…This large disparity suggests that automatically transcribing audio and then accessing it as a written document would be most effective for information retrieval tasks. However, in reading a text transcript, the prosodic cues, which make speech rich in meaning and subtlety, are lost [17].…”
Section: Introductionmentioning
confidence: 99%
“…The value of improved navigation into linear media through text transcripts has been acknowledged for webcast lectures [15] and discussed in the context of experimentation with error-laden transcripts from automatic speech recognition (ASR) [17]. The inherent value of a searchable transcript for navigating into linear audio (or the narrative audio track of linear video) can be seen with recent efforts by major Internet corporations such as Google and Microsoft to search within video as opposed to only searching to a video.…”
A digital video library of over 900 hours of video and 18000 stories from The HistoryMakers was used by 266 students, faculty, librarians, and life-long learners interacting with a system providing multiple search and viewing capabilities over a trial period of several months. User demographics and actions were logged with this multimedia collection, providing quantitative and qualitative metrics on system use. These transaction logs were complemented with heuristic evaluation, interviews, and contextual inquiry with representative users. Collectively, these mixed methods informed the development of the next generation web-based interface for the HistoryMakers video oral histories to improve access to and dissemination of this rich cultural resource. In particular, the feature of a synchronized text transcript in the video player for the narratives merited further investigation. Such an interface has not seen widespread use in digital video players available on the web, yet was valued highly by oral history archive viewers. A user study with 27 participants measured the utility of the HistoryMakers web interface incorporating the synchronized transcript video player for stated fact-finding and open-ended tasks. For life oral histories, an aligned text transcript is valued for both tasks, with the video rated significantly more useful for open-ended tasks over fact-finding. These results suggest a task-dependent role of modality in presentation of oral histories, with synchronized transcripts rated highly across tasks.
We presented participants with lecture videos at different speeds and tested immediate and delayed (1 week) comprehension. Results revealed minimal costs incurred by increasing video speed from 1x to 1.5x, or 2x speed, but performance declined beyond 2x speed. We also compared learning outcomes after watching videos once at 1x or twice at 2x speed. There was not an advantage to watching twice at 2x speed but if participants watched the video again at 2x speed immediately before the test, compared with watching once at 1x a week before the test, comprehension improved. Thus, increasing the speed of videos (up to 2x) may be an efficient strategy, especially if students use the time saved for additional studying or rewatching the videos, but learners should do this additional studying shortly before an exam.However, these trends may differ for videos with different speech rates, complexity or difficulty, and audiovisual overlap.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.