Machine Learning vs. Rules and Out-of-the-Box vs. Retrained

Tkaczyk, Dominika; Collins, Andrew; Sheridan, Paraic; Beel, Joeran

doi:10.1145/3197026.3197048

Cited by 33 publications

(11 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ahmad and Afzal (2018) evaluate GROBID for de- tecting inline citations using a corpus of 5k Cite-Seer papers, and found GROBID to have an F1score of 0.89 on this task. Tkaczyk et al (2018) report GROBID as the best among 10 out-of-the-box tools for parsing bibliographies, also achieving an F1 of 0.89 in an evaluation corpus of 9.5k papers.…”

Section: Discussionmentioning

confidence: 95%

S2ORC: The Semantic Scholar Open Research Corpus

Lo¹,

Wang²,

Neumann³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

274

193

View full text Add to dashboard Cite

We introduce S2ORC, 1 a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automaticallydetected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.

show abstract

Section: Discussionmentioning

confidence: 95%

S2ORC: The Semantic Scholar Open Research Corpus

Lo¹,

Wang²,

Neumann³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

274

193

View full text Add to dashboard Cite

show abstract

“…Popular approaches include regular expressions, knowledge bases, supervised machine learning, and hybrid approaches. Regular expressions are usually combined with additional approaches, for example with knowledge bases such as thesauri or ontologies, however in such approaches the system must first be filled with available knowledge [13]. Recently in 2019, an unsupervised rule-based approach was proposed that identifies units in source data and provides a corresponding semantic representation based on NASA's QUDT (Quantity, Unit, Dimension and Type) ontology using Arpeggio as a grammar parser [11].…”

Section: Related Workmentioning

confidence: 99%

“…In a supervised machine learning-based approach, measurement parsing is usually formally defined as a sequence labelling problem encompassing a variety of tasks, e.g. part-of-speech (POS) tagging or named-entity recognition (NER) [13]. Most of the existing tools are trainable, which means that they are able to automatically learn complex features and adapt parsing rules from training data.…”

Section: Related Workmentioning

confidence: 99%

Geo-Quantities: A Framework for Automatic Extraction of Measurements and Spatial Context from Scientific Documents

Petersen

Suryani

Beth

et al. 2021

17th International Symposium on Spatial and Temporal Databases

View full text Add to dashboard Cite

Quantitative information derived from scientific documents provides an important source of data for studies in almost all domains, however, manual extraction of this information is very time consuming. In this paper we will introduce a system Geo-Quantities that supports the automatic extraction of quantitative, spatial and temporal information of a given measurement entity from scientific literature using text mining techniques. The difficulty of automatic measurement recognition is mainly caused by the diverse expressions in the papers. Geo-Quantities offers an interactive interface for the visualization of extracted user-defined information, in particular spatial and temporal context. In our demonstration, we will showcase the capabilities of our system by retrieving measurements such as "mass accumulation rates" and "sedimentation rates" from scientific publications in the field of marine geology, which could have high impact in studies for building global mass accumulation rate maps. For training and evaluation of Geo-Quantities we use a corpus of domain-relevant papers. CCS CONCEPTS• Applied computing → Document management and text processing; • Computing methodologies → Information extraction.

show abstract

“…These RME methods have their merits and shortcomings, as presented in Tkaczyk, Collins, Sheridan, and Beel's (2018) comprehensive comparison of these methods. Although machine learning‐based approaches require minimal human involvement (aside from annotation on training data) to obtain satisfactory performance, they suffer from data sparseness and a lack of generality.…”

Section: Related Workmentioning

confidence: 99%

A flexible template generation and matching method with applications for publication reference metadata extraction

Yang

Hsieh

Liu

et al. 2020

Asso for Info Science & Tech

View full text Add to dashboard Cite

Conventional rule‐based approaches use exact template matching to capture linguistic information and necessarily need to enumerate all variations. We propose a novel flexible template generation and matching scheme called the principle‐based approach (PBA) based on sequence alignment, and employ it for reference metadata extraction (RME) to demonstrate its effectiveness. The main contributions of this research are threefold. First, we propose an automatic template generation that can capture prominent patterns using the dominating set algorithm. Second, we devise an alignment‐based template‐matching technique that uses a logistic regression model, which makes it more general and flexible than pure rule‐based approaches. Last, we apply PBA to RME on extensive cross‐domain corpora and demonstrate its robustness and generality. Experiments reveal that the same set of templates produced by the PBA framework not only deliver consistent performance on various unseen domains, but also surpass hand‐crafted knowledge (templates). We use four independent journal style test sets and one conference style test set in the experiments. When compared to renowned machine learning methods, such as conditional random fields (CRF), as well as recent deep learning methods (i.e., bi‐directional long short‐term memory with a CRF layer, Bi‐LSTM‐CRF), PBA has the best performance for all datasets.

show abstract

Machine Learning vs. Rules and Out-of-the-Box vs. Retrained

Cited by 33 publications

References 33 publications

S2ORC: The Semantic Scholar Open Research Corpus

S2ORC: The Semantic Scholar Open Research Corpus

Geo-Quantities: A Framework for Automatic Extraction of Measurements and Spatial Context from Scientific Documents

A flexible template generation and matching method with applications for publication reference metadata extraction

Contact Info

Product

Resources

About