Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.447
|View full text |Cite
|
Sign up to set email alerts
|

S2ORC: The Semantic Scholar Open Research Corpus

Abstract: We introduce S2ORC, 1 a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automaticallydetected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a uni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
234
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 274 publications
(234 citation statements)
references
References 44 publications
0
234
0
Order By: Relevance
“…We observed documentation of this explosive growth as well when surveying Literature Mining systems built on top of this dataset. The CORD-19 data is cleaned and implemented with the same system used for the Semantic Scholar Open Research Corpus [41].…”
Section: Literature Miningmentioning
confidence: 99%
“…We observed documentation of this explosive growth as well when surveying Literature Mining systems built on top of this dataset. The CORD-19 data is cleaned and implemented with the same system used for the Semantic Scholar Open Research Corpus [41].…”
Section: Literature Miningmentioning
confidence: 99%
“…The corpus combines papers from the PubMed Central (PMC), PubMed, World Health Organization (WHO)’s COVID-19 database ( https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov ) and preprint servers bioRxiv, medRxiv and arXiv. Paper metadata from these sources are harmonized, PDFs are converted into machine-readable JSON using the S2ORC pipeline described in [ 54 ] and HTML representations of tables in papers are added using IBM Watson Discovery’s Global Table Extractor [ 115 ]. As of 15 September 2020, the corpus contains more than 260 000 paper entries (with 105 000 full text entries).…”
Section: Text Mining Corporamentioning
confidence: 99%
“…We extract figures and captions from open access papers in PubMed Central using the results of Siegel et al (2018). To add inline references, we match extracted figures to corresponding figures in the publicly-available S2ORC corpus (Lo et al, 2020), then extract inline references for these figures from the S2ORC full text (see Appendix A for details). We exclude figures found in ROCO (Pelka et al, 2018) to create a disjoint dataset, though we identify and release inline references for ROCO figures.…”
Section: The Medicat Datasetmentioning
confidence: 99%