NovelTM Datasets for English-Language Fiction, 1700-2009

Underwood, Ted; Kimutis, Patrick; Witte, Jessica

doi:10.22148/001c.13147

Cited by 9 publications

(14 citation statements)

References 8 publications

(8 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1. Fiction (411 documents): We randomly sample two or three volumes per year from the class of English-language fiction drawn from the "most frequently reproduced" fiction in Underwood et al (2020). Our training data is thus designed to represent a yearly cross-section of the most reproduced English-language fiction between 1800 and 2000.…”

Section: Training Datamentioning

confidence: 99%

“…While admirable technical solutions have been implemented to make copyright-restricted material accessible, these solutions require high levels of technical expertise that exceed most researchers. 1 It is with these challenges and opportunities in mind that we build on prior work (Underwood et al, 2020) to generate a large sample of richly annotated prose data in English drawn from the Hathi Trust Digital Library. The Hathi Trust Digital Library represents the largest collection of digitized historical documents in English, with over 17 million volumes.…”

mentioning

confidence: 99%

“…For the period under investigation, writing in prose is by far the dominant mode during these years. While prior work has developed models for the detection of prose fiction during this period (Underwood et al, 2020), we build models for the creation of symmetrical collections of fictional and non-fictional modes of writing. Our work is motivated by theoretical frameworks grounded in theories of social differentiation (Luhmann, 1995), where the meaning and function of different modes of communication evolve in distinction to one another.…”

mentioning

confidence: 99%

“…Single model. Our data is generated from a single predictive model for each mode of writing (fiction/non-fiction) across the entire sampled time-frame based on manually reviewed training data derived from prior work (Underwood et al, 2020). While there is still much work to be done regarding the implications of applying predictive models on collections spanning long historical time frames, as we show the use of a single model overcomes important anomalies introduced by the conjoining of multiple models from prior work (see Figure 4).…”

mentioning

confidence: 99%

See 3 more Smart Citations

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Bagga

Piper

2022

Journal of Open Humanities Data

View full text Add to dashboard Cite

We present a new dataset built on prior work consisting of 1,671,370 randomly sampled pages of English-language prose roughly divided between modes of fictional and non-fictional writing and published between the years 1800 and 2000. In addition to focusing on the "page'' as the basic bibliographic unit, our work employs a single predictive model for the historical period under consideration in contrast to prior work. Besides publication metadata, we also provide an enriched feature set of 107 features including part-of-speech tags, sentiment scores, word supersenses and more. Our data is designed to give researchers in the digital humanities large yet portable random samples of historical writing across two foundational modes of English prose writing. We present initial insights into transformations of linguistic patterns across this historical period using our enriched features as possible pointers to future work. The data can be accessed at https://doi.org/10.7910/DVN/HAKKUA.

show abstract

Section: Training Datamentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

Bagga

Piper

2022

Journal of Open Humanities Data

View full text Add to dashboard Cite

show abstract

“…HathiTrust volume identifiers were matched based on string comparisons against the "volumemeta" dataset described by Ted Underwood et al (2020). I used the shorttitle and author metadata fields.…”

Section: Collection and Creationmentioning

confidence: 99%

New York Times Hardcover Fiction Bestsellers, 1931—2020

Pruett

2022

Post45 Data Collective

View full text Add to dashboard Cite

The New York Times Hardcover Fiction Bestsellers (1931–2020) contains three related datasets. The first dataset provides a tabular representation of the hardcover fiction bestseller list of The New York Times every week between 1931 and 2020. The second dataset provides title-level data for every unique title that appeared on the hardcover fiction bestseller list during this time period. The third dataset provides HathiTrust Digital Library identifiers for every unique title that appeared on the hardcover fiction bestseller list and that also has a corresponding volume in the HathiTrust Digital Library.

show abstract

Uncovering Black Fantastic: Piloting A Word Feature Analysis and Machine Learning Approach for Genre Classification

Parulian

Dubnicek

Worthey

et al. 2022

Proceedings of the Association for Information Science and Tech

View full text Add to dashboard Cite

Given the size of digital library collections and the inconsistencies in their genre‐related bibliographic metadata, as digital libraries grow and their contents are opened for computational analysis, finding materials of interest becomes a major challenge. This challenge increases for sub‐genres and other categories of text data that are less distinct from the whole. This project pilots machine learning methods and word feature analysis for identifying Black Fantastic genre texts within the HathiTrust Digital Library. These texts are sometimes referred to as “Afrofuturism” but more commonly today described as “Black Fantastic,” in which African Diaspora artists and creators engage with the intersections of race and technology in their works with a primary focus on world‐building. Black Fantastic texts pose a challenge to genre classification, as they incorporate aspects of science fiction and fantasy with typical characteristics of African Diaspora‐produced literature. This paper presents and reports on results from a pilot predictive modeling process to computationally identify Black Fantastic texts using curated word feature sets for each class of data: general English‐language fiction, Black‐authored fiction, and Black Fantastic fiction.

show abstract

NovelTM Datasets for English-Language Fiction, 1700-2009

Cited by 9 publications

References 8 publications

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust

New York Times Hardcover Fiction Bestsellers, 1931—2020

Uncovering Black Fantastic: Piloting A Word Feature Analysis and Machine Learning Approach for Genre Classification

Contact Info

Product

Resources

About