2018
DOI: 10.1007/978-3-030-00794-2_30
|View full text |Cite
|
Sign up to set email alerts
|

Morphological and Language-Agnostic Word Segmentation for NMT

Abstract: The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 5 publications
(4 reference statements)
0
9
0
Order By: Relevance
“…Banerjee and Bhattacharyya (2018) also use unsupervised morphological units generated by Morfessor (Virpioja et al, 2013) as input for a neural machine translation system and report improvements for low-resource conditions. Macháček et al (2018) follow a similar approach for translation into Czech on WMT data, but were not able to obtains improvements over the standard BPE approach.…”
Section: Related Workmentioning
confidence: 99%
“…Banerjee and Bhattacharyya (2018) also use unsupervised morphological units generated by Morfessor (Virpioja et al, 2013) as input for a neural machine translation system and report improvements for low-resource conditions. Macháček et al (2018) follow a similar approach for translation into Czech on WMT data, but were not able to obtains improvements over the standard BPE approach.…”
Section: Related Workmentioning
confidence: 99%
“…On the other side of the spectrum, it has been observed that automatically learned subwords generally do not correspond to linguistic entities such as morphemes, suffixes, affixes etc. However, linguistically-motivated subword units as proposed by Huck et al (2017), Macháček et al (2018), Ataman et al (2017), Pinnis et al (2017) that also take morpheme boundaries into account do not always improve over completely data-driven ones.…”
Section: Subword-unit-based Nmtmentioning
confidence: 99%
“…Using unsupervisedly obtained "morphological" subwords on the other hand, only Ataman and Federico (2018b) find that a model based on Morfessor FlatCat can outperform BPE; Zhou (2018), , Macháček et al (2018), and Saleva and Lignos (2021) find no reliable improvement over BPE for translation. Banerjee and Bhattacharyya (2018) analyze translations obtained segmenting with Morfessor and BPE, and conclude that a possible improvement depends on the similarity of the languages.…”
Section: Comparing Morphological Segmentation To Bpe and Friendsmentioning
confidence: 99%