2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2019
DOI: 10.1109/jcdl.2019.00035
|View full text |Cite
|
Sign up to set email alerts
|

An End-to-End Approach for Extracting and Segmenting High-Variance References from PDF Documents

Abstract: This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by followin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 16 publications
0
10
0
Order By: Relevance
“…Given the large number of authors who share the same name (i.e., homonymy), it is difficult to link names in bibliographic sources to their real-world authors, especially when the source of the reference is not available or does not provide indicators of the author's identity. The problem is even more critical when names are substituted by their initials to save space and when they are erroneous due to wrong manual editing as found in our previous work [1]. Disciplines such as social sciences and humanities suffer more from this problem as most of the publishers are small or medium-sized and cannot ensure a continuous integrity of the bibliographic data.…”
Section: Introductionmentioning
confidence: 94%
See 1 more Smart Citation
“…Given the large number of authors who share the same name (i.e., homonymy), it is difficult to link names in bibliographic sources to their real-world authors, especially when the source of the reference is not available or does not provide indicators of the author's identity. The problem is even more critical when names are substituted by their initials to save space and when they are erroneous due to wrong manual editing as found in our previous work [1]. Disciplines such as social sciences and humanities suffer more from this problem as most of the publishers are small or medium-sized and cannot ensure a continuous integrity of the bibliographic data.…”
Section: Introductionmentioning
confidence: 94%
“…Here, an author name denotes a set of character sequences that refer to one or more people 1 , whereas real-world author entity indicates a unique author that cannot be identified only by his/her name 2 but with the help of other identifiers such as ORCID. However, in bibliographic data (e.g., references), authors are usually referred to by name only.…”
Section: Introductionmentioning
confidence: 99%
“…The problem is that this type of approach is more perceptive to accumulated errors within the pipeline. To overcome such an issue Boukhers et al [4] proposed an end-to-end approach for reference segmentation and extraction from PDF documents. It learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation using a probabilistic approach.…”
Section: Related Workmentioning
confidence: 99%
“…font type, neighbor distance, text location, font typography, and lexical properties to identify components of a scientific publication and later extract metadata like Authors name, affiliation, email, headings, etc. Boukhers et al [12] proposed an approach in which all text lines are individually classified using a pre-trained random forest model with the probability to be a potential reference line and later uses the format, lexical, semantic and shape features to identify and segment reference strings.…”
Section: A Text-based Approachesmentioning
confidence: 99%