Automatic Identification of Research Articles Containing Data Usage Statements

Zhang, Qiuzi; Lu, Wei; Yang, Yunhan; Chen, Haihua; Chen, Jiangping

doi:10.1142/9789813234482_0004

Cited by 2 publications

(2 citation statements)

References 26 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This heterogeneity in the data set mentions translates into a difference in the performance of the empirical strategies used to solve the problem, according to the scientific field in which it is used. Zhang et al (2016Zhang et al ( , 2018 describe implementing a bootstrapping-based unsupervised training strategy based on previous work to distinguish articles with data use and reuse from those without data usage.…”

Section: Data Set Mentions Are Domain-specificmentioning

confidence: 99%

Data Inventories for the Modern Age? Using Data Science to Open Government Data

Lane

Gimeno

Levitskaya

et al. 2022

Harvard Data Science Review

View full text Add to dashboard Cite

Section: Data Set Mentions Are Domain-specificmentioning

confidence: 99%

Data Inventories for the Modern Age? Using Data Science to Open Government Data

Lane

Gimeno

Levitskaya

et al. 2022

Harvard Data Science Review

View full text Add to dashboard Cite

“…Several methods, such as weakly supervised (Hoffmann et al , 2011) and unsupervised learning (Zhang and Elhadad, 2013), have been proposed to address training corpus acquisition. Zhang et al (2017) proposed an unsupervised approach based on pattern lists to identify data usage at the article level. By applying a bootstrapping strategy to generate text patterns automatically, their method can achieve an F-measure of 85% in determining whether a data usage statement is included in computer science literature.…”

Section: Literature Reviewmentioning

confidence: 99%

Data set entity recognition based on distant supervision

Liu

Cheng

et al. 2021

Self Cite

View full text Add to dashboard Cite

Purpose This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain. Design/methodology/approach Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities. Findings In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition. Originality/value This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

show abstract

Automatic Identification of Research Articles Containing Data Usage Statements

Cited by 2 publications

References 26 publications

Data Inventories for the Modern Age? Using Data Science to Open Government Data

Data Inventories for the Modern Age? Using Data Science to Open Government Data

Data set entity recognition based on distant supervision

Contact Info

Product

Resources

About