From Characters to Words to in Between: Do We Capture Morphology?

Vania, Clara; Lopez, Adam

doi:10.18653/v1/p17-1184

Cited by 69 publications

(111 citation statements)

References 24 publications

(24 reference statements)

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…With a few notable exceptions (Vania and Lopez, 2017;Heigold et al, 2017), there was no systematic investigation of the various modelling architectures. In our work we address the question of what linguistic lexical aspects are best encoded in each type of architecture, and their efficacy as part of a machine translation model when translating from morpho- …”

Section: Related Workmentioning

confidence: 99%

“…Recent studies are exploring representations at the subword level that can provide information even for rare and unseen words. Well-known examples are character and character-ngram-based embeddings (Sperr et al, 2013;Vania and Lopez, 2017), morphological embeddings (Luong et al, 2013; Botha and Blunsom, 2014; Cotterell and Schütze, 2015; Cao and Rei, 2016), or byte embeddings (Plank et al, 2016;Gillick et al, 2016). were the first to integrate character-based embeddings into a syntactic parser and compared the effect for different languages with different levels of morphological richness.…”

Section: Introductionmentioning

confidence: 99%

“…We can note that these types of models have also been applied with success to several other task, including learning word representations (Qiu et al, 2014;Bojanowski et al, 2016;Wieting et al, 2016), POS tagging (Plank et al, 2016; Heigold et al, 2017), Named entity recognition (Gillick et al, 2016), Parsing (Ballesteros et al, 2015) and Machine translation . Recently, an exhaustive summary of previous work on word representation by composing subword units was presented in (Vania and Lopez, 2017). This work also compares the types of subword unit, how they are composed, and their impact on various morphological typologies.…”

mentioning

confidence: 99%

“…in speech recognition and machine translation, and they require both syntactic and semantic information. Table 5: Perplexity for different language models on German texts from Wikipedia.In our experiment, we use the framework 1 and setup described in Vania and Lopez (2017) to build a language model for German texts. The framework includes implementations for word and subword-based (morpheme, character or character n-gram) embeddings and uses either bidirectional LSTMs or addition as the combination function of subwords.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Proceedings of the First Workshop on Subword and Character Level Models in NLP

2017

View full text Add to dashboard Cite

IntroductionTraditional NLP starts with a hand-engineered layer of representation, the level of tokens or words. A tokenization component first breaks up the text into units using manually designed rules. Tokens are then processed by components such as word segmentation, morphological analysis and multiword recognition. The heterogeneity of these components makes it hard to create integrated models of both structure within tokens (e.g., morphology) and structure across multiple tokens (e.g., multi-word expressions). This approach can perform poorly (i) for morphologically rich languages, (ii) for noisy text, (iii) for languages in which the recognition of words is difficult and (iv) for adaptation to new domains; and (v) it can impede the optimization of preprocessing in end-to-end learning.The workshop provides a forum for discussing recent advances as well as future directions on sub-word and character-level natural language processing and representation learning that address these problems.We received 37 submissions, out of which we accepted 24 as papers and 4 as extended abstracts. AbstractMost of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement. IntroductionMost of neural language models, such as n-gram models (Bengio et al., 2003) are word based and rely on the definition of a finite vocabulary V. Therefore, a look-up table maps each wordw ∈ V to a vector of real features, and is stored in a matrix. While this approach yields significant improvement for a variety of tasks and languages, see for instance (Schwenk, 2007) in speech recognition and (Le et al., 2012; Devlin et al., 2014; in machine translation, it induces several limitations.For morphologically-rich languages, like Czech or German, the lexical coverage is still an important issue, since there is a combinatorial explosion of word forms, most of which are hardly observed on training data. On the one hand, growing the look-up table is not a solution, since it would increase the number of parameters without having enough training examples for a proper estimation. On the other hand, rare words can be replaced...

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Proceedings of the First Workshop on Subword and Character Level Models in NLP

2017

View full text Add to dashboard Cite

show abstract

“…Recent studies are exploring representations at the subword level that can provide information even for rare and unseen words. Well-known examples are character and character-ngram-based embeddings (Sperr et al, 2013;dos Santos and Zadrozny, 2014;Ling et al, 2015;Vania and Lopez, 2017), morphological embeddings (Luong et al, 2013;Botha and Blunsom, 2014;Cotterell and Schütze, 2015;Cao and Rei, 2016), or byte embeddings (Plank et al, 2016;Gillick et al, 2016). Ballesteros et al (2015) were the first to integrate character-based embeddings into a syntactic parser and compared the effect for different languages with different levels of morphological richness.…”

Section: Introductionmentioning

confidence: 99%

What do we need to know about an unknown word when parsing German

Rehbein

Frank

2017

Proceedings of the First Workshop on Subword and Character Level Models in NLP

View full text Add to dashboard Cite

We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency parser and compare the results to the ones obtained with other types of embeddings. Our evaluation shows that adding compound embeddings yields a significant improvement of 2% LAS over using word embeddings when no POS information is available. When adding POS embeddings to the input, however, the effect levels out. This suggests that it is not the missing information about the semantics of the unknown words that causes problems for parsing German, but the lack of morphological information for unknown words. To augment our evaluation, we also test the new embeddings in a language modelling task that requires both syntactic and semantic information.

show abstract

Data augmentation for low‐resource languages NMT guided by constrained sampling

et al. 2021

View full text Add to dashboard Cite

Data augmentation (DA) is a ubiquitous approach for several text generation tasks. Intuitively, in the machine translation paradigm, especially in low‐resource languages scenario, many DA methods have appeared. The most commonly used methods are building pseudocorpus by randomly sampling, omitting, or replacing some words in the text. However, previous approaches hardly guarantee the quality of augmented data. In this study, we try to augment the corpus by introducing a constrained sampling method. Additionally, we also build the evaluation framework to select higher quality data after augmentation. Namely, we use the discriminator submodel to mitigate syntactic and semantic errors to some extent. Experimental results show that our augmentation method consistently outperforms all the previous state‐of‐the‐art methods on both small and large‐scale corpora in eight language pairs from four corpora by 2.38–4.18 bilingual evaluation understudy points.

show abstract

From Characters to Words to in Between: Do We Capture Morphology?

Cited by 69 publications

References 24 publications

Proceedings of the First Workshop on Subword and Character Level Models in NLP

Proceedings of the First Workshop on Subword and Character Level Models in NLP

What do we need to know about an unknown word when parsing German

Data augmentation for low‐resource languages NMT guided by constrained sampling

Contact Info

Product

Resources

About