2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953260
|View full text |Cite
|
Sign up to set email alerts
|

Weakly supervised spoken term discovery using cross-lingual side information

Abstract: Recent work on unsupervised term discovery (UTD) aims to identify and cluster repeated word-like units from audio alone. These systems are promising for some very low-resource languages where transcribed audio is unavailable, or where no written form of the language exists. However, in some cases it may still be feasible (e.g., through crowdsourcing) to obtain (possibly noisy) text translations of the audio. If so, this information could be used as a source of side information to improve UTD. Here, we present … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 21 publications
(30 reference statements)
0
8
0
Order By: Relevance
“…Poor cross-speaker matches and low audio coverage prevent our system from achieving a high recall, suggesting the of use speech features that are effective in multi-speaker settings (Kamper et al, 2015;Kamper et al, 2016a) and speaker normalization (Zeghidour et al, 2016). Finally, Bansal et al (2017) recently showed that UTD can be improved using the translations themselves as a source of information, which suggests joint learning as an attractive area for future work.…”
Section: Discussionmentioning
confidence: 98%
See 1 more Smart Citation
“…Poor cross-speaker matches and low audio coverage prevent our system from achieving a high recall, suggesting the of use speech features that are effective in multi-speaker settings (Kamper et al, 2015;Kamper et al, 2016a) and speaker normalization (Zeghidour et al, 2016). Finally, Bansal et al (2017) recently showed that UTD can be improved using the translations themselves as a source of information, which suggests joint learning as an attractive area for future work.…”
Section: Discussionmentioning
confidence: 98%
“…Many of these errors are due to cross-speaker matches, which are known to be more challenging for UTD (Carlin et al, 2011;Kamper et al, 2015;Bansal et al, 2017). Most matches in our corpus are across calls, yet these are also the least accurate (Table 1).…”
Section: Assigning Wrong Words To a Clustermentioning
confidence: 91%
“…For endangered languages (extremely low-resource settings) the lack of training data leads to the problem being framed as a sparse translation problem. This semi-supervised task lies between speech translation and keyword spotting, with cross-lingual supervision being used for word segmentation [30,31,32,33]. Bilingual setups for word segmentation were discussed by [34,35,36,37], but applied to speech transcripts (true phones).…”
Section: Related Workmentioning
confidence: 99%
“…We extend s2t to identify new instances of those prototypes in the unlabeled speech, using a modified version of ZRTools, the same UTD toolkit used by UTD-align. 3 (Jansen et al, 2010) Previous work has indicated that using translation text to inform acoustic clustering provides more accurate clusters than just using UTD (Bansal et al, 2017a), so we initially expected that this straightforward extension of s2t would work better than UTD-align. However, early experiments indicated that the text had too much influence on clustering, yielding clusters with highly diverse audio, and thus poor prototypes.…”
Section: Methodsmentioning
confidence: 99%