A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks -sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning -using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.
Although the Ordnance Survey has itself been the subject of historical research, scholars have not systematically used its maps as primary sources of information. This is partly for disciplinary reasons and partly for the technical reason that high-quality maps have not until recently been available digitally, geo-referenced, and in color. A final, and crucial, addition has been the creation of item-level metadata which allows map collections to become corpora which can for the first time be interrogated en masse as source material. By applying new Computer Vision methods leveraging machine learning, we outline a research pipeline for working with thousands (rather than a handful) of maps at once, which enables new forms of historical inquiry based on spatial analysis. Our ‘patchwork method’ draws on the longstanding desire to adopt an overall or ‘complete’ view of a territory, and in so doing highlights certain parallels between the situation faced by today’s users of digitized maps, and a similar inflexion point faced by their predecessors in the nineteenth century, as the project to map the nation approached a form of completion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.