An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents

Boukhers, Zeyd; Ambhore, Shriharsh; Staab, Steffen

doi:10.1109/jcdl.2019.00035

Cited by 10 publications

(10 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given the large number of authors who share the same name (i.e., homonymy), it is difficult to link names in bibliographic sources to their real-world authors, especially when the source of the reference is not available or does not provide indicators of the author's identity. The problem is even more critical when names are substituted by their initials to save space and when they are erroneous due to wrong manual editing as found in our previous work [1]. Disciplines such as social sciences and humanities suffer more from this problem as most of the publishers are small or medium-sized and cannot ensure a continuous integrity of the bibliographic data.…”

Section: Introductionmentioning

confidence: 94%

“…Here, an author name denotes a set of character sequences that refer to one or more people 1 , whereas real-world author entity indicates a unique author that cannot be identified only by his/her name 2 but with the help of other identifiers such as ORCID. However, in bibliographic data (e.g., references), authors are usually referred to by name only.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Bib2Auth: Deep Learning Approach for Author Disambiguation using Bibliographic Data

Boukhers¹,

Nagaraj²,

Thulsi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Author name ambiguity remains a critical open problem in digital libraries due to synonymy and homonymy of names. In this paper, we propose a novel approach to link author names to their real-world entities by relying on their co-authorship pattern and area of research. Our supervised deep learning model identifies an author by capturing his/her relationship with his/her co-authors and area of research, which is represented by the titles and sources of the target author's publications. These attributes are encoded by their semantic and symbolic representations. To this end, Bib2Auth uses ∼ 22K bibliographic records from DBLP repository and is trained with each pair of co-authors. The extensive experiments have proved the capability of the approach to distinguish between authors sharing the same name and recognize authors with different name variations. Bib2Auth has shown good performance on a relatively large dataset, which qualifies it to be directly integrated into bibliographic indices.

show abstract

Section: Introductionmentioning

confidence: 94%

Section: Introductionmentioning

confidence: 99%

Bib2Auth: Deep Learning Approach for Author Disambiguation using Bibliographic Data

Boukhers¹,

Nagaraj²,

Thulsi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The problem is that this type of approach is more perceptive to accumulated errors within the pipeline. To overcome such an issue Boukhers et al [4] proposed an end-to-end approach for reference segmentation and extraction from PDF documents. It learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation using a probabilistic approach.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Approach for Metadata Extraction from German Scientific Publications

Bouabdallah¹,

Gavilan²,

Jennifer³

et al. 2021

Preprint

View full text Add to dashboard Cite

Nowadays, metadata information is often given by the authors themselves upon submission. However, a significant part of already existing research papers have missing or incomplete metadata information. German scientific papers come in a large variety of layouts which makes the extraction of metadata a non-trivial task that requires a precise way to classify the metadata extracted from the documents. In this paper, we propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language. We consider multiple types of input data by combining natural language processing and image vision processing. This model aims to increase the overall accuracy of metadata extraction compared to other state-of-the-art approaches. It enables the utilization of both spatial and contextual features in order to achieve a more reliable extraction. Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.

show abstract

“…font type, neighbor distance, text location, font typography, and lexical properties to identify components of a scientific publication and later extract metadata like Authors name, affiliation, email, headings, etc. Boukhers et al [12] proposed an approach in which all text lines are individually classified using a pre-trained random forest model with the probability to be a potential reference line and later uses the format, lexical, semantic and shape features to identify and segment reference strings.…”

Section: A Text-based Approachesmentioning

confidence: 99%

A Hybrid Approach and Unified Framework for Bibliographic Reference Extraction

2020

View full text Add to dashboard Cite

Publications are an integral part of a scientific community. Bibliographic reference extraction from scientific publication is a challenging task due to diversity in referencing styles and document layout. Existing methods perform sufficiently on one dataset however, applying these solutions to a different dataset proves to be challenging. Therefore, a generic solution was anticipated which could overcome the limitations of the previous approaches. The contribution of this paper is three-fold. First, it presents a novel approach called DeepBiRD which is inspired by human visual perception and exploits layout features to identify individual references in a scientific publication. Second, we release a large dataset for image-based reference detection with 2401 scans containing 38863 references, all manually annotated for individual reference. Third, we present a unified and highly configurable end-to-end automatic bibliographic reference extraction framework called BRExSys which employs DeepBiRD along with state-of-the-art text-based models to detect and visualize references from a bibliographic document. Our proposed approach pre-processes the images in which a hybrid representation is obtained by processing the given image using different computer vision techniques. Then, it performs layout driven reference detection using Mask R-CNN on a given scientific publication. DeepBiRD was evaluated on two different datasets to demonstrate the generalization of this approach. The proposed system achieved an AP50 of 98.56% on our dataset. DeepBiRD significantly outperformed the current state-of-the-art approach on their dataset. Therefore, suggesting that DeepBiRD is significantly superior in performance, generalized, and independent of any domain or referencing style.

show abstract

An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents

Cited by 10 publications

References 16 publications

Bib2Auth: Deep Learning Approach for Author Disambiguation using Bibliographic Data

Bib2Auth: Deep Learning Approach for Author Disambiguation using Bibliographic Data

Multimodal Approach for Metadata Extraction from German Scientific Publications

A Hybrid Approach and Unified Framework for Bibliographic Reference Extraction

Contact Info

Product

Resources

About