The Labeled Segmentation of Printed Books

McConnaughey, Lara; Dai, Jiazhong; Bamman, David

doi:10.18653/v1/d17-1077

Cited by 9 publications

(8 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Research Center supports a number of venues for scholarship over the HathiTrust Digital Library, including the public release of the Extracted Features dataset on which HT+BW is built (Jett et al, 2020). The HathiTrust corpus has seen especially strong scholarly adoption by those interested in cultural analytics and digital humanists (e.g., Underwood 2019, Manovich 2018Evans and Wilkens 2018;McConnaughey et al 2017), but the breadth of languages and subjects combined with the depth of the collection makes it very broadly useful.…”

Section: Background On Digital Librariesmentioning

confidence: 99%

“…Recent work has sought to better identify duplication in the HathiTrust (Organisciak et al 2019). Additionally, it has been found that date of first publication can often be inferred from the earliest known duplicate in the HathiTrust (Bamman et al 2017). This work will inform future iterations of HT+BW, reducing duplication bias and better aligning texts and dates.…”

Section: Selection Biases and Omissionsmentioning

confidence: 99%

“…The rapid recent development of scanned text digital libraries provides the material for new scales of historic and humanistic inquiry into the published word. However, while the size of collections such as the HathiTrust Digital Library, Google Books, and Internet Archive allows for more comprehensive, aggregate-level insights of culture and language across eras (e.g., Michel et al 2011, Aiden and Michel 2014, McConnaughey et al 2017, Manovich 2018Evans and Wilkens 2018), the burdens of scale also limit the flexibility and approachability of those insights. This paper argues for flexible and easy-to-use exploratory tools to understand massive text collections, in order to support new forms of corpus-based scholarship.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Giving shape to large digital libraries through exploratory data analysis

Organisciak

Schmidt

Downie

2021

Asso for Info Science & Tech

View full text Add to dashboard Cite

The emergence of large multi-institutional digital libraries has opened the door to aggregate-level examinations of the published word. Such large-scale analysis offers a new way to pursue traditional problems in the humanities and social sciences, using digital methods to ask routine questions of large corpora. However, inquiry into multiple centuries of books is constrained by the burdens of scale, where statistical inference is technically complex and limited by hurdles to access and flexibility. This work examines the role that exploratory data analysis and visualization tools may play in understanding large bibliographic datasets. We present one such tool, HathiTrust+Bookworm, which allows multi-faceted exploration of the multi-million work HathiTrust Digital Library, and center it in the broader space of scholarly tools for exploratory data analysis.

show abstract

Section: Background On Digital Librariesmentioning

confidence: 99%

Section: Selection Biases and Omissionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Giving shape to large digital libraries through exploratory data analysis

Organisciak

Schmidt

Downie

2021

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…The HathiTrust books were provided as a folder of text files representing pages. These were preprocessed to strip headers using the HathiTrust Research Center RunningHeaders tool 3 as well as to separate out the body of the book from its front and back matter (McConnaughey et al, 2017). We also performed further preprocessing to split them into paragraphs, sentences, and tokens.…”

Section: Dataset Preparationmentioning

confidence: 99%

What time is it? Temporal Analysis of Novels

Allen¹,

Pethe²,

Skiena³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Recognizing the flow of time in a story is a crucial aspect of understanding it. Prior work related to time has primarily focused on identifying temporal expressions or relative sequencing of events, but here we propose computationally annotating each line of a book with wall clock times, even in the absence of explicit time-descriptive phrases. To do so, we construct a data set of hourly time phrases from 52,183 fictional books. We then construct a time-of-day classification model that achieves an average error of 2.27 hours. Furthermore, we show that by analyzing a book in whole using dynamic programming of breakpoints, we can roughly partition a book into segments that each correspond to a particular time-of-day. This approach improves upon baselines by over two hours. Finally, we apply our model to a corpus of literature categorized by different periods in history, to show interesting trends of hourly activity throughout the past. Among several observations we find that the fraction of events taking place past 10 P.M jumps past 1880-coincident with the advent of the electric light bulb and city lights.

show abstract

“…4 In the natural language processing research area, previous research has been carried out to extract document structure mainly from scientific articles and books. [5][6][7] Other than this, there has been much recent work in using text mining and sentiment analysis in particular to Twitter with the goal of predicting stock market performance [8][9][10][11][12] although presumably any really successful methods would not be published.…”

Section: Related Workmentioning

confidence: 99%

Multilingual Financial Narrative Processing: Analyzing Annual Reports in English, Spanish, and Portuguese

El-Haj

Rayson

Alves

et al. 2019

Multilingual Text Analysis

View full text Add to dashboard Cite

This chapter describes and evaluates the use of Information Extraction and Natural Language Processing methods for extraction and analysis of financial annual reports in three languages: English, Spanish and Portuguese. The work described retains information on document structure which is needed to enable a clear distinction between narrative and financial statement components of annual reports and between individual sections within the narratives component. Extraction accuracy varies between languages with English exceeding 95 %. We apply the extraction methods on a comprehensive sample of annual reports published by UK, Spanish and Portuguese non-financial firms between 2003 and 2014.

show abstract

The Labeled Segmentation of Printed Books

Cited by 9 publications

References 32 publications

Giving shape to large digital libraries through exploratory data analysis

Giving shape to large digital libraries through exploratory data analysis

What time is it? Temporal Analysis of Novels

Multilingual Financial Narrative Processing: Analyzing Annual Reports in English, Spanish, and Portuguese

Contact Info

Product

Resources

About