Unsupervised Word Segmentation from Speech with Attention

Godard, Pierre; Boito, Marcely Zanon; Ondel, Lucas; Bérard, Alexandre; Yvon, François; Villavicencio, Aline; Besacier, Laurent

doi:10.21437/interspeech.2018-1308

Cited by 23 publications

(32 citation statements)

References 21 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As in language documentation scenarios available corpora usually contain speech in the language to document aligned with translations in a well-resourced language, Godard et al [5] introduced a pipeline for performing Unsupervised Word Segmentation (UWS) from speech. The system outputs timestamps delimiting stretches of speech, associated with class labels, corresponding to real words in the language.…”

Section: Unsupervised Word Segmentation From Speechmentioning

confidence: 99%

“…For each S2S architecture, and each of the three corpora, we train five models (runs) with different initialization seeds. 3 Before segmenting, we average the produced matrices from the five different runs as in [5]. Evaluation is done in a bilingual segmentation condition that corresponds to the real UWS task.…”

Section: Comparing S2s Architecturesmentioning

confidence: 99%

“…This paper proposes an empirical evaluation of well-known S2S models for a particular S2S modeling task. This task consists of aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5]. We concentrate on three models: Convolutional Neural Networks (CNN) [2], Recurrent Neural Networks (RNN) [1] and Transformer-based models [3].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-Resource Settings

Boito¹,

Villavicencio²,

Besacier³

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Since Bahdanau et al. [1] first introduced attention for neural machine translation, most sequence-to-sequence models made use of attention mechanisms [2, 3, 4]. While they produce soft-alignment matrices that could be interpreted as alignment between target and source languages, we lack metrics to quantify their quality, being unclear which approach produces the best alignments. This paper presents an empirical evaluation of 3 of the main sequence-to-sequence models for word discovery from unsegmented phoneme sequences: CNN, RNN and Transformer-based. This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5]. Evaluating word segmentation quality can be seen as an extrinsic evaluation of the soft-alignment matrices produced during training. Our experiments in a low-resource scenario on Mboshi and English languages (both aligned to French) show that RNNs surprisingly outperform CNNs and Transformer for this task. Our results are confirmed by an intrinsic evaluation of alignment quality through the use Average Normalized Entropy (ANE). Lastly, we improve our best word discovery model by using an alignment entropy confidence measure that accumulates ANE over all the occurrences of a given alignment pair in the collection.

show abstract

Section: Unsupervised Word Segmentation From Speechmentioning

confidence: 99%

Section: Comparing S2s Architecturesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-Resource Settings

Boito¹,

Villavicencio²,

Besacier³

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Aside from directly improving performance on various tasks, attention Luong et al, 2015) has proven to be extremely useful when used indirectly in a wide variety of other ways (for example, for segmentation (Tang and Yang, 2018) and unsupervised speechto-text alignment (Boito et al, 2017;Godard et al, 2018)). In addition, using attention-based models for object segmentation in a weakly supervised setting has been well explored in the vision domain (Teh et al, 2016;Zhang et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

Weakly Supervised Attention Networks for Entity Recognition

Patra

Moniz

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

The task of entity recognition has traditionally been modelled as a sequence labelling task. However, this usually requires a large amount of fine-grained data annotated at the token level, which in turn can be expensive and cumbersome to obtain. In this work, we aim to circumvent this requirement of word-level annotated data. To achieve this, we propose a novel architecture for entity recognition from a corpus containing weak binary presence/absence labels, which are relatively easier to obtain. We show that our proposed weakly supervised model, trained solely on a multi-label classification task, performs reasonably well on the task of entity recognition, despite not having access to any token-level ground truth data.

show abstract

“…The same task has been attempted [14] using NMT with attention [15] to align speech or phone sequences to the word labels of the high-resourced language; modifications of the attention mechanism to ensure coverage and richer context. If the true phone sequence in the under-resourced language is unknown, pseudo-phone labels generated by an unsupervised non-parametric Bayesian model [6] can be used as input to the NMT [16].…”

Section: Introductionmentioning

confidence: 99%

Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts

Wang

Hasegawa‐Johnson²

2019

Interspeech 2019

View full text Add to dashboard Cite

This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000image Flickr8k tests set.

show abstract

Unsupervised Word Segmentation from Speech with Attention

Cited by 23 publications

References 21 publications

Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-Resource Settings

Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-Resource Settings

Weakly Supervised Attention Networks for Entity Recognition

Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts

Contact Info

Product

Resources

About