The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Haaf, Susanne; Geyken, Alexander; Wiegand, Frank

doi:10.4000/jtei.1114

Cited by 13 publications

(6 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The resulting OCR full text was processed by the OCR software ABBYY Finereader 9 and consists of approximately 500 million characters and 65 million tokens. As a second aspect of text quality, we enhanced the level of document structure according to an agreed standard format together with our partners, the Deutsches Textarchiv (DTA; Haaf Geyken and Wiegand, 2014/15). Figure 1 shows manually corrected and tagged "zoning information" based on coordinates provided by the ABBYY Finereader XML files.…”

Section: Digitizing University Libraries As Full-text Providers For Clarinmentioning

confidence: 99%

“…There has to be a decision for ALTO, PAGE, TEI or other file formats, possibly together with 'annotation guidelines'. (Haaf et al, 2014/15) Supplemental note: The use of format converters should be considered with care. There will always be a loss of information converting from a format to another.…”

Section: Prospects For Future Collaboration Between Clarin and Academic Librariesmentioning

confidence: 99%

See 1 more Smart Citation

Digitizing University Libraries – Evolving from Full-Text Providers to CLARIN Contact Points on Campuses

Nölte¹,

Mehlberg²

2021

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

Based on the example of the State and University Library Bremen (SuUB) we will outline in this paper, how academic libraries with digitization activities (hereinafter referred to as digitizing libraries) could establish even closer ties to CLARIN in the future. After describing SuUB's past and current CLARIN-related activities (especially full-text transfers to a CLARIN-D centre) we suggest that this collaboration could be expanded by providing advice and training for researchers of the Digital Humanities as potential CLARIN users. Equally important from our point of view is the discussion about future structural options on the level of research infrastructures. We suggest a collaboration between digitizing libraries to jointly agree upon standards of data quality, file formats, interfaces and web services. We discuss the foundation of local CLARIN contact points to pass scholars and researchers on to the respective contact or service of CLARIN. The relevance to CLARIN activities, resources, tools or services is described at the end of each respective section. From the conclusions, the reader will notice: It is the right time for change.

show abstract

Section: Digitizing University Libraries As Full-text Providers For Clarinmentioning

confidence: 99%

Section: Prospects For Future Collaboration Between Clarin and Academic Librariesmentioning

confidence: 99%

Digitizing University Libraries – Evolving from Full-Text Providers to CLARIN Contact Points on Campuses

Nölte¹,

Mehlberg²

2021

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

show abstract

“…The .json format is intended for ease of use and speed of processing while retaining some expressiveness. Our XML format is built on top of a "Base Format", the socalled DTA-Basisformat 3 (Haaf et al, 2014) that not only constrains the data to TEI P5 guidelines, but also regarding a stricter relaxNG schema that we modified for our annotation. 4 We built a large, comprehensive, and easily searchable resource of New High German poetry by collecting and parsing the bulk of digitized corpora that contain public domain German literature.…”

Section: Large Poetry Corporamentioning

confidence: 99%

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

Haider¹

2021

Preprint

View full text Add to dashboard Cite

A prerequisite for the computational study of literature is the availability of properly digitized texts, ideally with reliable meta-data and ground-truth annotation. Poetry corpora do exist for a number of languages, but larger collections lack consistency and are encoded in various standards, while annotated corpora are typically constrained to a particular genre and/or were designed for the analysis of certain linguistic features (like rhyme). In this work, we provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models that enable robust large scale analysis.We show that BiLSTM-CRF models with syllable embeddings outperform a CRF baseline and different BERT-based approaches. In a multi-task setup, particular beneficial task relations illustrate the inter-dependence of poetic features. A model learns foot boundaries better when jointly predicting syllable stress, aesthetic emotions and verse measures benefit from each other, and we find that caesuras are quite dependent on syntax and also integral to shaping the overall measure of the line.

show abstract

“…To follow our logic of interoperability, like many other literary corpora, we have decided to encode our corpus in XML-TEI P5 [Burnard 2014]. Because documenting the encoding choices is (sadly according to Burnard [2019]) not common in France, our decisions are inspired by two non-French projects: the Deutsches Textarchiv (DTA) [Haaf et al 2014] Table 5: Lemmatisation accuracies of the best model on outof-domain data.…”

Section: Data Structurementioning

confidence: 99%

Corpus17

Gabay

Bartz

Deguin

2020

Proceedings of the 2nd International Conference on Digital Tools &Amp; Uses Congress

View full text Add to dashboard Cite

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Cited by 13 publications

References 3 publications

Digitizing University Libraries – Evolving from Full-Text Providers to CLARIN Contact Points on Campuses

Digitizing University Libraries – Evolving from Full-Text Providers to CLARIN Contact Points on Campuses

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

Corpus17

Contact Info

Product

Resources

About