2013
DOI: 10.1007/s10579-013-9239-y
|View full text |Cite
|
Sign up to set email alerts
|

An open diachronic corpus of historical Spanish

Abstract: The impact-es diachronic corpus of historical Spanish compiles over one hundred books -containing approximately 8 million words-in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research.Approximately 7% of the words in the corpus (a sel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
10
1

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(14 citation statements)
references
References 6 publications
1
10
1
Order By: Relevance
“…In order to create a CSMT translation model, the training word pairs need to be aligned character by character. While this can be done using weighted finite state transducers (Jiampojamarn, Kondrak and Sherif 2007) or using a simple method based on the longest common subsequence (Sánchez-Martínez et al 2013), better results have been obtained with GIZA++, a more complex tool originally developed for aligning words in parallel sentences (Pettersson et al 2013b, 2014).…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…In order to create a CSMT translation model, the training word pairs need to be aligned character by character. While this can be done using weighted finite state transducers (Jiampojamarn, Kondrak and Sherif 2007) or using a simple method based on the longest common subsequence (Sánchez-Martínez et al 2013), better results have been obtained with GIZA++, a more complex tool originally developed for aligning words in parallel sentences (Pettersson et al 2013b, 2014).…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…CSMT requires less training data than word-level SMT but is limited to applications where regular changes occur at the character level. It has been successfully used for translation between closely related languages (Vilar et al 2007; Tiedemann 2009), transliteration (Tiedemann 2009), lexicon induction (Scherrer and Sagot 2014), cognate generation (Beinborn, Zesch and Gurevych 2013), standardisation of user-generated content (De Clercq et al 2013; Ljubešić, Erjavec and Fišer 2014) and finally normalisation of historical words (Pettersson, Megyesi and Tiedemann 2013b; Sánchez-Martínez et al 2013; Scherrer and Erjavec 2013; Pettersson, Megyesi and Nivre 2014). CSMT models have been shown to outperform stochastic transducers on a number of tasks (Tiedemann 2009); they are more flexible as phrases can be long (up to ten characters) and of variable length.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…A more recent approach is based on characterbased statistical machine translation applied to historical text (Pettersson et al, 2013;Sánchez-Martínez et al, 2013;Scherrer and Erjavec, 2013; or dialectal data (Scherrer and Ljubešić, 2016). This is conceptually very similar to our approach, except that we substitute the classical SMT algorithms for neural networks.…”
Section: Related Workmentioning
confidence: 99%
“…For Italian, they were Google Italian Ngram (40 billion words, spanning the early 1500s to the early 2000s) (Lin et al 2012) and DiaCoris (20 million words, spanning the late 1800s to the early 2000s) (Onelli et al 2006). For Spanish, they were Google Spanish Ngram (84 billion words, spanning the early 1500s to the early 2000s) (Lin et al 2012) and IMPACT-es (8 million words, spanning the late 1400s to the mid 1700s) (Sánchez-Martínez et al 2013).…”
Section: Diachronic Study: Corporamentioning
confidence: 99%