Proceedings of the 13th Linguistic Annotation Workshop 2019
DOI: 10.18653/v1/w19-4018
|View full text |Cite
|
Sign up to set email alerts
|

One format to rule them all – The emtsv pipeline for Hungarian

Abstract: We present a more efficient version of the e-magyar NLP pipeline for Hungarian called emtsv. It integrates Hungarian NLP tools in a framework whose individual modules can be developed or replaced independently and allows new ones to be added. The design also allows convenient investigation and manual correction of the data flow from one module to another. The improvements we publish include effective communication between the modules and support of the use of individual modules both in the chain and standing a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 9 publications
0
8
0
Order By: Relevance
“…The poems in raw TXT were tokenized, lemmatized, and morphologically analyzed by means of the emtsv system (Indig et al (2019) also known as E-magyar; Váradi et al ( 2018)). In addition, they were phonetically transcribed using the eSpeak synthesizer.…”
Section: Data and Annotationmentioning
confidence: 99%
“…The poems in raw TXT were tokenized, lemmatized, and morphologically analyzed by means of the emtsv system (Indig et al (2019) also known as E-magyar; Váradi et al ( 2018)). In addition, they were phonetically transcribed using the eSpeak synthesizer.…”
Section: Data and Annotationmentioning
confidence: 99%
“…This was done as follows: the 300K subset of the 2020 Hungarian news subcorpus was downloaded from the Leipzig Corpora Collection 22 by Goldhahn, Eckart & Quasthoff (2012). Morphological and syntactic dependency analysis were performed on these sentences using the emagyar text processing system by Indig et al (2019) and Váradi et al (2018). This allowed to annotate the sentences as follows:…”
Section: Linguistic Probing Tasksmentioning
confidence: 99%
“…The TEI XML files contain not only the text of the poems but among other types of annotations, the lemma, the part of speech and the morphosyntactic features of words as well. These grammatical annotations have been created by the program e-magyar, an NLP tool for the automatic analysis of the grammatical features of Hungarian texts (Váradi et al 2018;Indig et al 2019). The research corpus containing the texts of 23 Hungarian poets has 11,262 poems and 2,120,996 words.…”
Section: Corpus and Toolsmentioning
confidence: 99%