An open diachronic corpus of historical Spanish

Sánchez-Martínez, Felipe; Martínez-Sempere, Isabel; Ivars-Ribes, Xavier; Carrasco, Rafael C.

doi:10.1007/s10579-013-9239-y

Cited by 11 publications

(14 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to create a CSMT translation model, the training word pairs need to be aligned character by character. While this can be done using weighted finite state transducers (Jiampojamarn, Kondrak and Sherif 2007) or using a simple method based on the longest common subsequence (Sánchez-Martínez et al 2013), better results have been obtained with GIZA++, a more complex tool originally developed for aligning words in parallel sentences (Pettersson et al 2013b, 2014).…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…CSMT requires less training data than word-level SMT but is limited to applications where regular changes occur at the character level. It has been successfully used for translation between closely related languages (Vilar et al 2007; Tiedemann 2009), transliteration (Tiedemann 2009), lexicon induction (Scherrer and Sagot 2014), cognate generation (Beinborn, Zesch and Gurevych 2013), standardisation of user-generated content (De Clercq et al 2013; Ljubešić, Erjavec and Fišer 2014) and finally normalisation of historical words (Pettersson, Megyesi and Tiedemann 2013b; Sánchez-Martínez et al 2013; Scherrer and Erjavec 2013; Pettersson, Megyesi and Nivre 2014). CSMT models have been shown to outperform stochastic transducers on a number of tasks (Tiedemann 2009); they are more flexible as phrases can be long (up to ten characters) and of variable length.…”

Section: Related Workmentioning

confidence: 99%

“…We present two experiments in this paper. In the first experiment – a supervised setting – we build a CSMT system analogously to previous work such as Pettersson et al (2013b) or Sánchez-Martínez et al (2013), assuming that training word pairs are available. In the second experiment – a setting which we call unsupervised – we only rely on monolingual word lists (i.e., no word pairs) for training.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Modernising historical Slovene words

Scherrer¹,

Erjavec²

2015

Nat. Lang. Eng.

View full text Add to dashboard Cite

We propose a language-independent word normalisation method and exemplify it on modernising historical Slovene words. Our method relies on character-level statistical machine translation (CSMT) and uses only shallow knowledge. We present relevant data on historical Slovene, consisting of two (partially) manually annotated corpora and the lexicons derived from these corpora, containing historical word-modern word pairs. The two lexicons are disjoint, with one serving as the training set containing 40,000 entries, and the other as a test set with 20,000 entries. The data spans the years 1750-1900, and the lexicons are split into 50-year slices, with all the experiments carried out separately on the three time periods. We perform two sets of experiments. In the first one -a supervised setting -we build a CSMT system using the lexicon of word pairs as training data. In the second one -an unsupervised setting -we simulate a scenario in which word pairs are not available. We propose a two-step method where we first extract a noisy list of word pairs by matching historical words with cognate modern words, and then train a CSMT system on these pairs. In both sets of experiments we also optionally make use of a lexicon of modern words to filter the modernisation hypotheses. While we show that both methods produce significantly better results than the baselines, their accuracy and which method works best strongly correlates with the age of the texts, meaning that the choice of the best method will depend on the properties of the historical language which is to be modernised. As an extrinsic evaluation we also compare the quality of part-of-speech tagging and lemmatisation directly on historical text and on its modernised words. We show that, depending on the age of the text, annotation on modernised words also produces significantly better results than annotation on the original text.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Modernising historical Slovene words

Scherrer¹,

Erjavec²

2015

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…A more recent approach is based on characterbased statistical machine translation applied to historical text (Pettersson et al, 2013;Sánchez-Martínez et al, 2013;Scherrer and Erjavec, 2013; or dialectal data (Scherrer and Ljubešić, 2016). This is conceptually very similar to our approach, except that we substitute the classical SMT algorithms for neural networks.…”

Section: Related Workmentioning

confidence: 99%

Learning attention for historical text normalization by learning to pronounce

Bollmann

Bingel

Søgaard

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Automated processing of historical texts often relies on pre-normalization to modern word forms. Training encoder-decoder architectures to solve such problems typically requires a lot of training data, which is not available for the named task. We address this problem by using several novel encoder-decoder architectures, including a multi-task learning (MTL) architecture using a grapheme-to-phoneme dictionary as auxiliary data, pushing the state-of-theart by an absolute 2% increase in performance. We analyze the induced models across 44 different texts from Early New High German. Interestingly, we observe that, as previously conjectured, multi-task learning can learn to focus attention during decoding, in ways remarkably similar to recently proposed attention mechanisms. This, we believe, is an important step toward understanding how MTL works.

show abstract

“…For Italian, they were Google Italian Ngram (40 billion words, spanning the early 1500s to the early 2000s) (Lin et al 2012) and DiaCoris (20 million words, spanning the late 1800s to the early 2000s) (Onelli et al 2006). For Spanish, they were Google Spanish Ngram (84 billion words, spanning the early 1500s to the early 2000s) (Lin et al 2012) and IMPACT-es (8 million words, spanning the late 1400s to the mid 1700s) (Sánchez-Martínez et al 2013).…”

Section: Diachronic Study: Corporamentioning

confidence: 99%

The rise and fall of the L-shaped morphome: diachronic and experimental studies

Nevins

Rodrigues

Tang

2015

Probus

View full text Add to dashboard Cite

It has been suggested that the Romance first person singular indicative constitutes a natural class with the present subjunctive paradigm for the purposes of stem selection (Maiden 2005), thus forming a kind of 'diagonal syncretism', as the latter shares no morphosyntactic features with the former. The existence of such patterns has been taken to be an argument for autonomous morphology and the existence of unnatural 'morphomes', in the sense of Aronoff (1994). Our experimental investigations with native speakers of Portuguese, Italian, and Spanish reveal that this pattern is underlearned, and that speakers do not generalize it to novel forms, instead preferring the 2nd person singular indicative to the 1st person as the base for the derivation of the subjunctive paradigm (and the 2nd person indicative as opposed to the 2nd person subjunctive as the base for the derivation of the 1st person indicative as well). The results implicate a role for naturalness biases in morphological structure, and an awareness that the first person singular is an unreliable and idiosyncratic base for productive inflectional identity. We then study the underlearning of the L-morphome in terms of historical change in the salience of these patterns. We demonstrate, through means of diachronic corpus studies spanning five centuries, a change in the ratio of first conjugation verbs to second & third conjugation verbs, and a resulting decrease in the relative type frequency of where morphomic verbs reside. If indeed learners need increased evidence in order to incorporate and actively uptake unnatural patterns, this lexical support has dwindled over time. Even though many of the morphomic verbs have maintained a very high token frequency (allowing them to survive as memorized), their productivity has diminished over time, and hence they go unlearned as a generalizable pattern. When the distribution of irregular alternations is overshadowed in the lexicon, a morphologically unnatural pattern may cease to maintain its status as part of the grammar.

show abstract

An open diachronic corpus of historical Spanish

Cited by 11 publications

References 6 publications

Modernising historical Slovene words

Modernising historical Slovene words

Learning attention for historical text normalization by learning to pronounce

The rise and fall of the L-shaped morphome: diachronic and experimental studies

Contact Info

Product

Resources

About