Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1077
|View full text |Cite
|
Sign up to set email alerts
|

The Labeled Segmentation of Printed Books

Abstract: We introduce the task of book structure labeling: segmenting and assigning a fixed category (such as TABLE OF CONTENTS, PREFACE, INDEX) to the document structure of printed books. We manually annotate the page-level structural categories for a large dataset totaling 294,816 pages in 1,055 books evenly sampled from 1750-1922, and present empirical results comparing the performance of several classes of models. The best-performing model, a bidirectional LSTM with rich features, achieves an overall accuracy of 95… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 32 publications
0
8
0
Order By: Relevance
“…The Research Center supports a number of venues for scholarship over the HathiTrust Digital Library, including the public release of the Extracted Features dataset on which HT+BW is built (Jett et al, 2020). The HathiTrust corpus has seen especially strong scholarly adoption by those interested in cultural analytics and digital humanists (e.g., Underwood 2019, Manovich 2018Evans and Wilkens 2018;McConnaughey et al 2017), but the breadth of languages and subjects combined with the depth of the collection makes it very broadly useful.…”
Section: Background On Digital Librariesmentioning
confidence: 99%
See 2 more Smart Citations
“…The Research Center supports a number of venues for scholarship over the HathiTrust Digital Library, including the public release of the Extracted Features dataset on which HT+BW is built (Jett et al, 2020). The HathiTrust corpus has seen especially strong scholarly adoption by those interested in cultural analytics and digital humanists (e.g., Underwood 2019, Manovich 2018Evans and Wilkens 2018;McConnaughey et al 2017), but the breadth of languages and subjects combined with the depth of the collection makes it very broadly useful.…”
Section: Background On Digital Librariesmentioning
confidence: 99%
“…Recent work has sought to better identify duplication in the HathiTrust (Organisciak et al 2019). Additionally, it has been found that date of first publication can often be inferred from the earliest known duplicate in the HathiTrust (Bamman et al 2017). This work will inform future iterations of HT+BW, reducing duplication bias and better aligning texts and dates.…”
Section: Selection Biases and Omissionsmentioning
confidence: 99%
See 1 more Smart Citation
“…The HathiTrust books were provided as a folder of text files representing pages. These were preprocessed to strip headers using the HathiTrust Research Center RunningHeaders tool 3 as well as to separate out the body of the book from its front and back matter (McConnaughey et al, 2017). We also performed further preprocessing to split them into paragraphs, sentences, and tokens.…”
Section: Dataset Preparationmentioning
confidence: 99%
“…4 In the natural language processing research area, previous research has been carried out to extract document structure mainly from scientific articles and books. [5][6][7] Other than this, there has been much recent work in using text mining and sentiment analysis in particular to Twitter with the goal of predicting stock market performance [8][9][10][11][12] although presumably any really successful methods would not be published.…”
Section: Related Workmentioning
confidence: 99%