Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-40
|View full text |Cite
|
Sign up to set email alerts
|

Speech Recognition for Medical Conversations

Abstract: In this paper we document our experiences with developing speech recognition for medical transcription -a system that automatically transcribes doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech. Because of noisy tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
51
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 57 publications
(52 citation statements)
references
References 9 publications
1
51
0
Order By: Relevance
“…We compare the behavior of CTC and S2S models on a clean (WSJ) and noisy (How-To) dataset, and see that CTC output tends to be very close to the acoustics of an utterance, while S2S output appears to be closer to the style of the transcriptions. Similar to [36], we find that S2S approaches are maybe surprisingly robust against "real-world" data, that has not been carefully prepared for speech recognition experiments. On the WSJ dataset, our system outperforms previous S2S implementations like [30].…”
Section: Discussionsupporting
confidence: 55%
“…We compare the behavior of CTC and S2S models on a clean (WSJ) and noisy (How-To) dataset, and see that CTC output tends to be very close to the acoustics of an utterance, while S2S output appears to be closer to the style of the transcriptions. Similar to [36], we find that S2S approaches are maybe surprisingly robust against "real-world" data, that has not been carefully prepared for speech recognition experiments. On the WSJ dataset, our system outperforms previous S2S implementations like [30].…”
Section: Discussionsupporting
confidence: 55%
“…The speaker ID inventory, H , consists of the invited speaker names (e.g., 'Alice' or 'Bob') and anonymous 'guest' IDs produced by the vision module (e.g., 'Speaker1' or 'Speaker2') 3 . In what follows, we propose a model for combining face tracking, face identification, speaker identification, SSL, and the TF masks generated by the preceding CSS module to calculate the speaker ID posterior probability of equation (1). The integration of these complementary cues would make speaker attribution robust to real world challenges, including speech overlaps, speaker co-location, and the presence of guest speakers.…”
Section: Speaker Diarizationmentioning
confidence: 99%
“…Meeting transcription and analytics would be a key to enhancing productivity as well as improving accessibility in the workplace. It can also be used for conversation transcription in other domains such as healthcare [1]. Research in this space was promoted in the 2000s by NIST Rich Transcription Evaluation series and public release of relevant corpora [2][3][4].…”
Section: Introductionmentioning
confidence: 99%
“…Importantly, we did not perform text normalization specific to each domain. In the domain of medical datasets, we use I2B2'14 (Stubbs and Uzuner, 2015), which consists of identified textual medical notes with PHI tagging, and the Audio Medical Conversations dataset from (Chiu et al, 2017), denoted AMC'17, which contains de-identified audio of doctor-patient conversations and their corresponding manual transcripts. Processing the AMC'17 conversations was facilitated by the fact that it is a de-identified dataset, which provides us with the locations of the PHI in the audio and the transcripts.…”
Section: Datasetsmentioning
confidence: 99%