Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copies or other difficulties such as spelling variation or complex text structures.Providers of data entry services using the double-keying method generally advertise very high accuracy rates (around 99.95% to 99.98%). These advertised percentages are generally estimated on the basis of small samples, and little if anything is said about either the actual amount of text or the text genres which have been proofread, about error types, proofreaders, etc. In order to obtain significant data on this problem it is necessary to analyze a large amount of text representing a balanced sample of different text types, to distinguish the structural XML/TEI level from the typographical level, and to differentiate between various types of errors which may originate from different sources and may not be equally severe. This paper presents an extensive and complex approach to the analysis and correction of doublekeying errors which has been applied by the DFG-funded project "Deutsches Textarchiv" (German Text Archive, hereafter DTA) in order to evaluate and preferably to increase the transcription and annotation accuracy of double-keyed DTA texts. Statistical analyses of the results gained from proofreading a large quantity of text are presented, which verify the common accuracy rates for the double-keying method.
Until recently the creation of large historical reference corpora was, from the point of view of its encoding, a rather project-specic activity. Although reference corpora were built from texts of various origins, the texts had to be converted into a tailor-made format. For example, corpora like the well-known British National Corpus 1 and the DWDS core corpus (Geyken 2007) are both annotated on the basis of the Guidelines of the Text Encoding Initiative (most recent release: P5; see TEI Consortium 2014). However, the encoding in these cases was typically carried out specically for the creation of these corpora-that is, it was a unidirectional process. The interchange, and more importantly the interoperability, of corpora with other corpora played only a minor role. 2 In recent years the picture has dramatically changed. With the availability of more and more digitized texts in TEI P5 format and the advent of many academic and non-academic corpusbuilding projects, the task of creating reference corpora has shifted from a project-specic task to a more general task, requiring joint eorts by many stakeholders. In this new situation, individual
Im vorliegenden Beitrag wird ein Verfahren zur systematischen korpusbasierten Untersuchung der wesentlichen Konstituenten erbaulicher Textsorten vorgestellt und anhand von drei Analysebeispielen verdeutlicht. Zugrunde liegt die Annahme, dass sich Merkmale verschiedener sprachlicher Dimensionen oftmals musterhaft an der Textoberfläche manifestieren. Angesichts der zunehmenden Verfügbarkeit umfangreicher historischer Korpora und computerlinguistischer Verfahren ist es nun möglich, qualitativ gewonnene Erkenntnisse zu den Merkmalen und typischen Textmustern erbaulicher Textsorten systematisch zu überprüfen bzw. zu konkretisieren. Im Fokus stehen hierbei Andachtsbücher und Leichenpredigten des 17. Jahrhunderts, welche in digitaler und strukturierter Form im Deutschen Textarchiv verfügbar sind. Die automatische Extraktion von Textmustern wird gerahmt von qualitativen Arbeitsschritten zu deren Spezifikation, Deutung und Einordnung.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.