2020
DOI: 10.22148/001c.13147
|View full text |Cite
|
Sign up to set email alerts
|

NovelTM Datasets for English-Language Fiction, 1700-2009

Abstract: This report accompanies a collection of 210,266 volumes, predicted to be fiction, that researchers are encouraged to borrow for their own work. We divide the collection into seven subsets with different emphases (for instance, one where books written by men and women are represented equally, and one composed of only the most prominent and widely-held books). Comparing the pictures produced by these different subsets allows us to assess the resilience or fragility of recent quantitative arguments about literary… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(14 citation statements)
references
References 8 publications
(8 reference statements)
0
10
0
Order By: Relevance
“…1. Fiction (411 documents): We randomly sample two or three volumes per year from the class of English-language fiction drawn from the "most frequently reproduced" fiction in Underwood et al (2020). Our training data is thus designed to represent a yearly cross-section of the most reproduced English-language fiction between 1800 and 2000.…”
Section: Training Datamentioning
confidence: 99%
See 3 more Smart Citations
“…1. Fiction (411 documents): We randomly sample two or three volumes per year from the class of English-language fiction drawn from the "most frequently reproduced" fiction in Underwood et al (2020). Our training data is thus designed to represent a yearly cross-section of the most reproduced English-language fiction between 1800 and 2000.…”
Section: Training Datamentioning
confidence: 99%
“…While admirable technical solutions have been implemented to make copyright-restricted material accessible, these solutions require high levels of technical expertise that exceed most researchers. 1 It is with these challenges and opportunities in mind that we build on prior work (Underwood et al, 2020) to generate a large sample of richly annotated prose data in English drawn from the Hathi Trust Digital Library. The Hathi Trust Digital Library represents the largest collection of digitized historical documents in English, with over 17 million volumes.…”
mentioning
confidence: 99%
See 2 more Smart Citations
“…HathiTrust volume identifiers were matched based on string comparisons against the "volumemeta" dataset described by Ted Underwood et al (2020). I used the shorttitle and author metadata fields.…”
Section: Collection and Creationmentioning
confidence: 99%