Susanne Haaf scite author profile

Susanne Haaf

5Publications

17Citation Statements Received

2Citation Statements Given

How they've been cited

How they cite others

Affiliations

Berlin-Brandenburg Academy of Sciences and Humanities

Publications

Order By: Most citations

Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text

Haaf¹,

Wiegand²,

Geyken³

2013

jtei

View full text Add to dashboard Cite

Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copies or other difficulties such as spelling variation or complex text structures.Providers of data entry services using the double-keying method generally advertise very high accuracy rates (around 99.95% to 99.98%). These advertised percentages are generally estimated on the basis of small samples, and little if anything is said about either the actual amount of text or the text genres which have been proofread, about error types, proofreaders, etc. In order to obtain significant data on this problem it is necessary to analyze a large amount of text representing a balanced sample of different text types, to distinguish the structural XML/TEI level from the typographical level, and to differentiate between various types of errors which may originate from different sources and may not be equally severe. This paper presents an extensive and complex approach to the analysis and correction of doublekeying errors which has been applied by the DFG-funded project "Deutsches Textarchiv" (German Text Archive, hereafter DTA) in order to evaluate and preferably to increase the transcription and annotation accuracy of double-keyed DTA texts. Statistical analyses of the results gained from proofreading a large quantity of text are presented, which verify the common accuracy rates for the double-keying method.

show abstract

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Haaf¹,

Geyken²,

Wiegand³

2014

jtei

View full text Add to dashboard Cite

Until recently the creation of large historical reference corpora was, from the point of view of its encoding, a rather project-specic activity. Although reference corpora were built from texts of various origins, the texts had to be converted into a tailor-made format. For example, corpora like the well-known British National Corpus 1 and the DWDS core corpus (Geyken 2007) are both annotated on the basis of the Guidelines of the Text Encoding Initiative (most recent release: P5; see TEI Consortium 2014). However, the encoding in these cases was typically carried out specically for the creation of these corpora-that is, it was a unidirectional process. The interchange, and more importantly the interoperability, of corpora with other corpora played only a minor role. 2 In recent years the picture has dramatically changed. With the availability of more and more digitized texts in TEI P5 format and the advent of many academic and non-academic corpusbuilding projects, the task of creating reference corpora has shifted from a project-specic task to a more general task, requiring joint eorts by many stakeholders. In this new situation, individual

show abstract

10. Das Deutsche Textarchiv als Forschungsplattform für historische Daten in CLARIN

Geyken¹,

Boenig²,

Haaf³

et al. 2018

View full text Add to dashboard Cite

Enabling the Encoding of Manuscripts within the DTABf: Extension and Modularization of the Format

Haaf¹,

Thomas²

2016

jtei

View full text Add to dashboard Cite

Mehrdimensionale Beschreibung erbaulicher Textsorten des 17. Jahrhunderts

Haaf

2019

View full text Add to dashboard Cite

Im vorliegenden Beitrag wird ein Verfahren zur systematischen korpusbasierten Untersuchung der wesentlichen Konstituenten erbaulicher Textsorten vorgestellt und anhand von drei Analysebeispielen verdeutlicht. Zugrunde liegt die Annahme, dass sich Merkmale verschiedener sprachlicher Dimensionen oftmals musterhaft an der Textoberfläche manifestieren. Angesichts der zunehmenden Verfügbarkeit umfangreicher historischer Korpora und computerlinguistischer Verfahren ist es nun möglich, qualitativ gewonnene Erkenntnisse zu den Merkmalen und typischen Textmustern erbaulicher Textsorten systematisch zu überprüfen bzw. zu konkretisieren. Im Fokus stehen hierbei Andachtsbücher und Leichenpredigten des 17. Jahrhunderts, welche in digitaler und strukturierter Form im Deutschen Textarchiv verfügbar sind. Die automatische Extraktion von Textmustern wird gerahmt von qualitativen Arbeitsschritten zu deren Spezifikation, Deutung und Einordnung.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Susanne Haaf

Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

10. Das Deutsche Textarchiv als Forschungsplattform für historische Daten in CLARIN

Enabling the Encoding of Manuscripts within the DTABf: Extension and Modularization of the Format

Mehrdimensionale Beschreibung erbaulicher Textsorten des 17. Jahrhunderts

Contact Info

Product

Resources

About