Extracting Scientific Figures with Distantly Supervised Neural Networks

Siegel, Noah; Lourie, Nicholas; Power, Russell; Ammar, Waleed

doi:10.1145/3197026.3197040

Cited by 109 publications

(83 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate our approaches by comparing them to the approaches from the related work. Specifically, we compare our approaches against PDFFigures [3], PDFFigures2 [4] and DeepFigures [5]. The results of our extraction pipeline show that our best approach automatically extracts figures with a precision of 0.73 and a recall of 0.80.…”

Section: Document Element Recognitionmentioning

confidence: 99%

“…Furthermore, a series of systems called PDFFigures [3], PDFFigures2 [4], which were developed by Clark et al as well as DeepFigures [5] that was developed by Siegel et al were developed for the inclusion in the Semantic Scholar search engine (https://www.semanticscholar.org) and discussed in the literature. In the following, we refer to these systems as the PDFFigures systems.…”

Section: Related Workmentioning

confidence: 99%

“…A Table is defined similarly, but using the term "Table". This definition system was implicitly proposed by Clark et al in work [3] and also used in the subsequent works [4,5]. It has the advantageous properties of being very suited to their use case and being easy to implement by a rule-based system.…”

Section: Problem Definitionmentioning

confidence: 99%

See 2 more Smart Citations

Data-Driven Recognition and Extraction of PDF Document Elements

et al. 2019

View full text Add to dashboard Cite

In the age of digitalization, the collection and analysis of large amounts of data is becoming increasingly important for enterprises to improve their businesses and processes, such as the introduction of new services or the realization of resource-efficient production. Enterprises concentrate strongly on the integration, analysis and processing of their data. Unfortunately, the majority of data analysis focuses on structured and semi-structured data, although unstructured data such as text documents or images account for the largest share of all available enterprise data. One reason for this is that most of this data is not machine-readable and requires dedicated analysis methods, such as natural language processing for analyzing textual documents or object recognition for recognizing objects in images. Especially in the latter case, the analysis methods depend strongly on the application. However, there are also data formats, such as PDF documents, which are not machine-readable and consist of many different document elements such as tables, figures or text sections. Although the analysis of PDF documents is a major challenge, they are used in all enterprises and contain various information that may contribute to analysis use cases. In order to enable their efficient retrievability and analysis, it is necessary to identify the different types of document elements so that we are able to process them with tailor-made approaches. In this paper, we propose a system that forms the basis for structuring unstructured PDF documents, so that the identified document elements can subsequently be retrieved and analyzed with tailor-made approaches. Due to the high diversity of possible document elements and analysis methods, this paper focuses on the automatic identification and extraction of data visualizations, algorithms, other diagram-like objects and tables from a mixed document body. For that, we present two different approaches. The first approach uses methods from the area of deep learning and rule-based image processing whereas the second approach is purely based on deep learning. To train our neural networks, we manually annotated a large corpus of PDF documents with our own annotation tool, of which both are being published together with this paper. The results of our extraction pipeline show that we are able to automatically extract graphical items with a precision of 0.73 and a recall of 0.8. For tables, we reach a precision of 0.78 and a recall of 0.94.

show abstract

Section: Document Element Recognitionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data-Driven Recognition and Extraction of PDF Document Elements

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Being able to automatically identify and decode mathematics (Lin et al, 2011;Wang and Liu, 2017a,b) in PDF files will enable a wide range of high-level applications such as information retrieval, machine reading, similarity analysis, information aggregation, and reasoning. Siegel et al (2018) discuss how to recover the positional information of figures in PDF files. The proposed methods could be also used for the alignment of MEs in PDF and XML files.…”

Section: A3 Action-graphs From Real Annotated Graphsmentioning

confidence: 99%

“…There is also an ongoing work on constructing knowledge graph from the scientific literature. Sinha et al (2015) builds a heterogeneous graph consisting of six types of entities: field of study, author, institution (the affiliation of the author), paper, venue (journal and conference series) and event Ammar et al (2018). focussed on constructing literature graph consisting of papers, authors, entities nodes and various interactions between…”

mentioning

confidence: 99%

Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

2019

View full text Add to dashboard Cite

Just when I thought I was out, they pull me back in-The role of KG in AKBC iii We thank our authors, speakers and program committee members for helping us assemble an exciting program on this timely topic. We are grateful to our sponsors-BASF SE Ludwigshafen, the Leibniz Science Campus "Empirical Linguistics and Computational Language Modeling" (LiMo), the German Research Foundation (DFG grant RO5127/2-1)-for making such a diverse and speaker-rich program possible.

show abstract

Machine Learning in Chemical Engineering: A Perspective

Schweidtmann

Esche

Fischer

et al. 2021

Chemie Ingenieur Technik

View full text Add to dashboard Cite

The transformation of the chemical industry to renewable energy and feedstock supply requires new paradigms for the design of flexible plants, (bio-)catalysts, and functional materials. Recent breakthroughs in machine learning (ML) provide unique opportunities, but only joint interdisciplinary research between the ML and chemical engineering (CE) communities will unfold the full potential. We identify six challenges that will open new methods for CE and formulate new types of problems for ML: (1) optimal decision making, (2) introducing and enforcing physics in ML, (3) information and knowledge representation, (4) heterogeneity of data, (5) safety and trust in ML applications, and (6) creativity. Under the umbrella of these challenges, we discuss perspectives for future interdisciplinary research that will enable the transformation of CE.

show abstract

Extracting Scientific Figures with Distantly Supervised Neural Networks

Cited by 109 publications

References 22 publications

Data-Driven Recognition and Extraction of PDF Document Elements

Data-Driven Recognition and Extraction of PDF Document Elements

Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

Machine Learning in Chemical Engineering: A Perspective

Contact Info

Product

Resources

About