2014
DOI: 10.4000/jtei.1114
|View full text |Cite
|
Sign up to set email alerts
|

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Abstract: Until recently the creation of large historical reference corpora was, from the point of view of its encoding, a rather project-specic activity. Although reference corpora were built from texts of various origins, the texts had to be converted into a tailor-made format. For example, corpora like the well-known British National Corpus 1 and the DWDS core corpus (Geyken 2007) are both annotated on the basis of the Guidelines of the Text Encoding Initiative (most recent release: P5; see TEI Consortium 2014). How… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 3 publications
0
6
0
Order By: Relevance
“…The resulting OCR full text was processed by the OCR software ABBYY Finereader 9 and consists of approximately 500 million characters and 65 million tokens. As a second aspect of text quality, we enhanced the level of document structure according to an agreed standard format together with our partners, the Deutsches Textarchiv (DTA; Haaf Geyken and Wiegand, 2014/15). Figure 1 shows manually corrected and tagged "zoning information" based on coordinates provided by the ABBYY Finereader XML files.…”
Section: Digitizing University Libraries As Full-text Providers For Clarinmentioning
confidence: 99%
See 1 more Smart Citation
“…The resulting OCR full text was processed by the OCR software ABBYY Finereader 9 and consists of approximately 500 million characters and 65 million tokens. As a second aspect of text quality, we enhanced the level of document structure according to an agreed standard format together with our partners, the Deutsches Textarchiv (DTA; Haaf Geyken and Wiegand, 2014/15). Figure 1 shows manually corrected and tagged "zoning information" based on coordinates provided by the ABBYY Finereader XML files.…”
Section: Digitizing University Libraries As Full-text Providers For Clarinmentioning
confidence: 99%
“…There has to be a decision for ALTO, PAGE, TEI or other file formats, possibly together with 'annotation guidelines'. (Haaf et al, 2014/15) Supplemental note: The use of format converters should be considered with care. There will always be a loss of information converting from a format to another.…”
Section: Prospects For Future Collaboration Between Clarin and Academic Librariesmentioning
confidence: 99%
“…The .json format is intended for ease of use and speed of processing while retaining some expressiveness. Our XML format is built on top of a "Base Format", the socalled DTA-Basisformat 3 (Haaf et al, 2014) that not only constrains the data to TEI P5 guidelines, but also regarding a stricter relaxNG schema that we modified for our annotation. 4 We built a large, comprehensive, and easily searchable resource of New High German poetry by collecting and parsing the bulk of digitized corpora that contain public domain German literature.…”
Section: Large Poetry Corporamentioning
confidence: 99%
“…To follow our logic of interoperability, like many other literary corpora, we have decided to encode our corpus in XML-TEI P5 [Burnard 2014]. Because documenting the encoding choices is (sadly according to Burnard [2019]) not common in France, our decisions are inspired by two non-French projects: the Deutsches Textarchiv (DTA) [Haaf et al 2014] Table 5: Lemmatisation accuracies of the best model on outof-domain data.…”
Section: Data Structurementioning
confidence: 99%