Glen Worthey scite author profile

Glen Worthey

5Publications

6Citation Statements Received

66Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Illinois Urbana-Champaign, Stanford University

Publications

Order By: Most citations

A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis

Dubnicek

Worthey

et al. 2022

View full text Add to dashboard Cite

This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019 [21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-aligned corpus that collected human-proofread texts from Project Gutenberg and paired OCR views from HathiTrust Digital Library, we extracted and aligned 167,079 sentences from 189 sampled books in four domains published from 1793 to 1984. To support downstream research on OCR quality, we conducted an analysis of OCR errors with a specific focus on their associations with the source text metadata. We found that sampled data in agriculture has a higher ratio of real-word errors than other domains, while sentences from social-science volumes contain more non-word errors. Besides, data sampled from early-age volumes tend to have a high ratio of non-word errors, while samples from recently-published volumes is likely to have more real-word errors. Following our findings, we suggest that scholars should consider the potential influence of source data characteristics on their findings in the study of OCR quality issues. CCS CONCEPTS• Information systems → Digital libraries and archives; • Applied computing → Document management and text processing; Document capture.

show abstract

Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections

Worthey

et al. 2021

View full text Add to dashboard Cite

Complexities associated with user-generated book reviews in digital libraries

LeBlanc

Diesner

et al. 2022

View full text Add to dashboard Cite

While digital libraries (DL) have made large-scale collections of digitized books increasingly available to researchers [31,67], there remains a dearth of similar data provisions or infrastructure for computational studies of the consumption and reception of books. In the last two decades, user-generated book reviews on social media have opened up unprecedented research possibilities for humanities and social sciences (HSS) scholars who are interested in book reception. However, limitations and gaps have emerged from existing DH research which utilize social media data for answering HSS questions. To shed light on the under-investigated features of user-generated book reviews and the challenges they might pose to scholarly research, we conducted three exemplar cases studies: (1) a longitudinal analysis for profiling the temporal changes of ratings and popularity of 552 books across ten years; (2) a cross-cultural comparison of book ratings of the same 538 books across two platforms; and, (3) a classification experiment on 20,000 sponsored and non-sponsored books reviews. Correspondingly, our research reveals the real-world complexities and under-investigated features of user-generated book reviews in three dimensions: the transience of book ratings and popularity (temporal dimension), the cross-cultural differences in reading interests and book reception (cultural dimension), and the user power dynamics behind the publicly accessible reviews ("political" dimension). Our case studies also demonstrate the challenges posed by user-generated book reviews' real-world complexities to their scholarly usage and propose solutions to these challenges. We conclude that DL stakeholders and scholars working with user-generated book reviews should look

show abstract

Documenting mixed reality performance: the case of CloudPad

et al. 2012

View full text Add to dashboard Cite

This article introduces an original documentation and archiving tool, CloudPad, that integrates 'cloud computing' into the annotation and synchronisation of mixed media resources. Through CloudPad users are able to view a documentation, edit a version of it, and record their own comments in response to it. Whether users may have created and/or experienced a particular work, or whether they may simply wish to consult a work's documentation, their journey through these records and annotations are subsumed into the work's documentation, thus augmenting the 'original' artwork's field of social engagement. Before discussing CloudPad in detail, we proceed to explain how recent debates in performance documentation influenced our methodology and development, and the general challenges of mixed reality documentation that CloudPad aims to address. Â© 2012 Copyright Taylor and Francis Group, LLC

show abstract

"Long Live the Corpse!" At Risk Open-Access Humanities Journals in LOCKSS

Worthey

2009

Against the Grain

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Glen Worthey

A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis

Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections

Complexities associated with user-generated book reviews in digital libraries

Documenting mixed reality performance: the case of CloudPad

"Long Live the Corpse!" At Risk Open-Access Humanities Journals in LOCKSS

Contact Info

Product

Resources

About