2021
DOI: 10.48550/arxiv.2106.15320
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

Abstract: We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(9 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…Typical modern methods rely on deep learning techniques to detect layout elements on pages [5], often in combination with heuristics [38]. Methods span the range of object detection using models like YOLO [32,38,47] to, more recently, Faster R-CNN [16,34,37,43,49] and Mask-CNN [2,18,26]. Additionally, several pixel-by-pixel segmentation models have been proposed using semantic segmentation [46] and fully convolutional networks [22,24], including "fully convolutional instance segmentation" [13,14,27].…”
Section: Deep Learning Model and Feature Selectionmentioning
confidence: 99%
See 4 more Smart Citations
“…Typical modern methods rely on deep learning techniques to detect layout elements on pages [5], often in combination with heuristics [38]. Methods span the range of object detection using models like YOLO [32,38,47] to, more recently, Faster R-CNN [16,34,37,43,49] and Mask-CNN [2,18,26]. Additionally, several pixel-by-pixel segmentation models have been proposed using semantic segmentation [46] and fully convolutional networks [22,24], including "fully convolutional instance segmentation" [13,14,27].…”
Section: Deep Learning Model and Feature Selectionmentioning
confidence: 99%
“…Table 3 shows how other deep learning models fair on our final test dataset. Here, we use ScanBank [21,47] (based on DeepFigures [38] and trained on a corpus of pre-digital electronic thesis and dissertations (ETDs)) and a version of detectron2 [45] trained on the PubLayNet dataset [52]. ScanBank and detectron2 are used for comparison as they are applied to raster-formatted articles (as opposed to vector-based methods like pdffigures2 [11] which, applied to our data, results in F1 scores of <15% in feature and final test splits).…”
Section: Benchmarks At High Levels Of Localization (Iou=09)mentioning
confidence: 99%
See 3 more Smart Citations