ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

Kahu, Sampanna Yashwant; Ingram, William A.; Fox, Edward A.; Wu, Jihuai

doi:10.48550/arxiv.2106.15320

Cited by 2 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Typical modern methods rely on deep learning techniques to detect layout elements on pages [5], often in combination with heuristics [38]. Methods span the range of object detection using models like YOLO [32,38,47] to, more recently, Faster R-CNN [16,34,37,43,49] and Mask-CNN [2,18,26]. Additionally, several pixel-by-pixel segmentation models have been proposed using semantic segmentation [46] and fully convolutional networks [22,24], including "fully convolutional instance segmentation" [13,14,27].…”

Section: Deep Learning Model and Feature Selectionmentioning

confidence: 99%

“…Table 3 shows how other deep learning models fair on our final test dataset. Here, we use ScanBank [21,47] (based on DeepFigures [38] and trained on a corpus of pre-digital electronic thesis and dissertations (ETDs)) and a version of detectron2 [45] trained on the PubLayNet dataset [52]. ScanBank and detectron2 are used for comparison as they are applied to raster-formatted articles (as opposed to vector-based methods like pdffigures2 [11] which, applied to our data, results in F1 scores of <15% in feature and final test splits).…”

Section: Benchmarks At High Levels Of Localization (Iou=09)mentioning

confidence: 99%

“…Deep learning methods have become popular recently for vector and raster documents [5,38], including those that use methods of semantic segmentation [46] and object detection [35]. While these methods are vital to the extraction of data products from recent academic literature, pre-digital literature is often included in digital platforms with older articles scanned at varying resolutions and deep learning methods developed with newer article training sets often perform poorly on this pre-digital literature [47]. Additionally, layouts, fonts, and article styles are typically different for historical documents when compared to "born-digital" scientific literature [47].…”

Section: Introductionmentioning

confidence: 99%

“…While these methods are vital to the extraction of data products from recent academic literature, pre-digital literature is often included in digital platforms with older articles scanned at varying resolutions and deep learning methods developed with newer article training sets often perform poorly on this pre-digital literature [47]. Additionally, layouts, fonts, and article styles are typically different for historical documents when compared to "born-digital" scientific literature [47]. In these cases, text extraction must be performed with optical character recognition (OCR), and figures and tables are extracted from the raw OCR results.…”

Section: Introductionmentioning

confidence: 99%

“…In these cases, text extraction must be performed with optical character recognition (OCR), and figures and tables are extracted from the raw OCR results. When applied to raster-PDF's with text generated from OCR, deep learning document layout analysis methods trained with newer or vector-based PDFs are often not as robust [46,47]. While progress has been made in augmenting these methods for OCR'd pages, especially for electronic theses and dissertations (ETDs) [47], much work can still be done to extract layout elements from these older, raster-based documents.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Naiman

Williams

Goodman

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the intersection-over-union (IOU) cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

show abstract

Section: Deep Learning Model and Feature Selectionmentioning

confidence: 99%

Section: Benchmarks At High Levels Of Localization (Iou=09)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Naiman

Williams

Goodman

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

From Detection to Application: Recent Advances in Understanding Scientific Tables and Figures

Huang,

Chen,

et al. 2024

ACM Comput. Surv.

View full text Add to dashboard Cite

Tables and figures are usually used to present information in a structured and visual way in scientific documents. Understanding the tables and figures in scientific documents is significant for a series of downstream tasks, such as academic search, scientific knowledge graphs, and so on. Existing studies mainly focus on detecting figures and tables from scientific documents, interpreting their semantics, and integrating them into downstream tasks. However, a systematic and comprehensive literature review on the mining and application of tables and figures in academic papers is still missing. In this article, we introduce the research framework and the whole pipeline for understanding tables and figures, including detection, structural analysis, interpretation, and application. We deliver a thorough analysis of benchmark datasets, recent techniques, and their pros and cons. Additionally, a quantitative analysis of the effectiveness of different models on popular benchmarks is presented. We further outline several important applications that exploit the semantics of scientific tables and figures. Finally, we highlight the challenges and some potential directions for future research. We believe this is the first comprehensive survey in understanding scientific tables and figures that covers the landscape from detection to application.

show abstract

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

Cited by 2 publications

References 16 publications

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

From Detection to Application: Recent Advances in Understanding Scientific Tables and Figures

Contact Info

Product

Resources

About