Multimodal Output Combination for Transcribing Historical Handwritten Documents

Granell, Emilio; Martínez-Hinarejos, Carlos D.

doi:10.1007/978-3-319-23192-1_21

Cited by 7 publications

(6 citation statements)

References 11 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We assume, however, that for a music practitioner it would be, at least, more appealing to play a composition reading a music sheet rather than manually transcribing it. Note that we find the same scenario in the field of Handwritten Text Recognition, where producing a uttering out of a written text and using a speech recognition system for then fusing the decisions required less effort than manually transcribing the text or correcting the errors produced by the text recognition system [8].…”

Section: Introductionmentioning

confidence: 53%

Multimodal image and audio music transcription

Fuente

Valero-Mas

Castellanos

et al. 2021

Int J Multimed Info Retr

View full text Add to dashboard Cite

Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) stand for the research fields that aim at obtaining a structured digital representation from sheet music images and acoustic recordings, respectively. While these fields have traditionally evolved independently, the fact that both tasks may share the same output representation poses the question of whether they could be combined in a synergistic manner to exploit the individual transcription advantages depicted by each modality. To evaluate this hypothesis, this paper presents a multimodal framework that combines the predictions from two neural end-to-end OMR and AMT systems by considering a local alignment approach. We assess several experimental scenarios with monophonic music pieces to evaluate our approach under different conditions of the individual transcription systems. In general, the multimodal framework clearly outperforms the single recognition modalities, attaining a relative improvement close to $$40\%$$ 40 % in the best case. Our initial premise is, therefore, validated, thus opening avenues for further research in multimodal OMR-AMT transcription.

show abstract

Section: Introductionmentioning

confidence: 53%

Multimodal image and audio music transcription

Fuente

Valero-Mas

Castellanos

et al. 2021

Int J Multimed Info Retr

View full text Add to dashboard Cite

show abstract

“…This framework employs the bimodal Confusion Network combination method defined in [7], [8]. Specifically, starting from the system and the speech decoding outputs in CN format, the following steps are taken: 1) Anchor subnetworks are searched in order to align the subnetworks of both Confusion Networks.…”

Section: B Multimodal Combinationmentioning

confidence: 99%

“…Read4SpeechExperiments is an Android free software application designed to facilitate the speech acquisition from mobile devices. The source code is available on GitLab 8 , and it can be installed from the Google Play 9 and the F-Droid 10 platforms.…”

Section: B Crowdsourcing Speech Acquisitionmentioning

confidence: 99%

Multimodal Crowdsourcing for Transcribing Handwritten Documents

Granell

Martínez-Hinarejos

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Abstract-Transcription of handwritten documents is an important research topic for multiple applications, such as document classification or information extraction. In the case of historical documents, their transcription allows to preserve cultural heritage because of the amount of historical data contained in those documents. The transcription process can employ state-ofthe-art handwritten text recognition systems in order to obtain an initial transcription. This transcription is usually not good enough for the quality standards, but that may speed up the final transcription of the expert. In this framework, the use of collaborative transcription applications (crowdsourcing) has risen in the recent years, but these platforms are mainly limited by the use of non-mobile devices. Thus, the recruiting initiatives get reduced to a smaller set of potential volunteers. In this work, an alternative that allows the use of mobile devices is presented. The proposal consists of using speech dictation of handwritten text lines. Then, by using multimodal combination of speech and handwritten text images, a draft transcription can be obtained, presenting more quality than that obtained by only using handwritten text recognition. The speech dictation platform is implemented as a mobile device application, which allows for a wider range of population for recruiting volunteers. A real acquisition on the contents of a Spanish historical handwritten book was obtained with the platform. This data was used to perform experiments on the behaviour of the proposed framework. Some experiments were performed to study how to optimise the collaborators effort in terms of number of collaborations, including how many lines and which lines should be selected for the speech dictation.

show abstract

“…The multimodal paradigm has experimented a spectacular growth in the latest years because of the development of mobile devices (Di Fabbrizio et al, 2009), where different modalities (speech and touch mainly) are employed for the device management. In the case of Image or Natural Language Processing tasks, multimodality has been applied to problems where signals of different nature that represent the same final object are available (Mihalcea, 2012;Potamianos et al, 2003;Sebe et al, 2005;Granell and Martínez-Hinarejos, 2015b). In any case, multimodality is strongly linked to human-computer interaction, since the user may employ different modalities to obtain a more ergonomic or faster interaction to achieve an objective.…”

Section: Introductionmentioning

confidence: 99%

Image–speech combination for interactive computer assisted transcription of handwritten documents

Granell

Romero

Martínez-Hinarejos

2019

Computer Vision and Image Understanding

Self Cite

View full text Add to dashboard Cite

Handwritten document transcription aims to obtain the contents of a document to provide efficient information access to, among other, digitised historical documents. The increasing number of historical documents published by libraries and archives makes this an important task. In this context, the use of image processing and understanding techniques in conjunction with assistive technologies reduces the time and human effort required for obtaining the final perfect transcription. The assistive transcription system proposes a hypothesis, usually derived from a recognition process of the handwritten text image. Then, the professional transcriber feedback can be used to obtain an improved hypothesis and speed-up the final transcription. In this framework, a speech signal corresponding to the dictation of the handwritten text can be used as an additional source of information. This multimodal approach, that combines the image of the handwritten text with the speech of the dictation of its contents, could make better the hypotheses (initial and improved) offered to the transcriber. In this paper we study the feasibility of a multimodal interactive transcription system for an assistive paradigm known as Computer Assisted Transcription of Text Images. Different techniques are tested for obtaining the multimodal combination in this framework. The use of the proposed multimodal approach reveals a significant reduction of transcription effort with some multimodal combination techniques, allowing for a faster transcription process.

show abstract

Multimodal Output Combination for Transcribing Historical Handwritten Documents

Cited by 7 publications

References 11 publications

Multimodal image and audio music transcription

Multimodal image and audio music transcription

Multimodal Crowdsourcing for Transcribing Handwritten Documents

Image–speech combination for interactive computer assisted transcription of handwritten documents

Contact Info

Product

Resources

About