CiteSeerX: AI in a Digital Library Search Engine

Wu, Jian; Williams, Kyle; Chen, Hung-Hsuan; Khabsa, Madian; Caragea, Cornelia; Ororbia, Alexander G.; Jordan, Douglas; Giles, C. Lee

doi:10.1609/aimag.v36i3.2601

Cited by 79 publications

(49 citation statements)

References 36 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GROBID 16 (GeneRation Of BIbliographical Data, version 0.4.1) is a complex metadata extraction tool for header metadata and bibliographical extractions. GROBID uses pdftoxml 17 for content and layout extraction and conditional random fields for learning [11,12]. As the results in Table 3 show, our method performs well on title extraction, getting almost the same accuracies as GRO-BID, which obtained the best overall results in the experiments of [10].…”

mentioning

confidence: 86%

See 1 more Smart Citation

The Use of Simple Cellular Automata in Image Processing

Bodó

Csató

2017

Studia UBB Informatica

View full text Add to dashboard Cite

Abstract. Metadata extraction from documents forms an essential part of web or desktop search systems. Similarly, digital libraries that index scholarly literature require to find and extract the title, the list of authors and other publication-related information from an article. We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training. An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.

show abstract

mentioning

confidence: 86%

“…Another possibility is to use a digital library, from where the documents and metadata can be obtained in a more straightforward manner. One such digital library is CiteSeerX [5,17], which offers an OAI collection for metadata harvesting. also offer a huge amount of bibliographic data.…”

Section: A Hybrid Approach For Metadata Extractionmentioning

confidence: 99%

The Use of Simple Cellular Automata in Image Processing

Bodó

Csató

2017

Studia UBB Informatica

View full text Add to dashboard Cite

show abstract

“…He has built several open source tools for metadata extraction using machine learning methods from PDFs and text for many unique entities such as figures, tables, equations, etc. and incorporated them into scholarly search engines such as CiteSeerX (Wu et al, 2015a) using an open source ingestion system (Wu et al, 2015b). Recent work has been on linking data and metadata in different databases such as PubMed and the Web of Science.…”

Section: Status Of Mvp On Daymentioning

confidence: 99%

An Open, FAIRified Data Commons: Proposal for NIH Data Commons Pilot

Nosek

Spies

Benjamin

et al. 2017

Preprint

View full text Add to dashboard Cite

This proposal is a response to NIH's call for creation of a Data Commons (RM-17-026). The Commons must support use cases of many stakeholders who need access to scholarly process, content, and outcomes in pursuit of knowledge. Moreover, the Commons must be flexible enough to respect researchers’ idiosyncratic workflows, yet specific enough to solve problems that researchers are trying to solve. To meet both demands, a successful Commons will provide core services that are shared across workflows, and flexible interfaces that meet the individual needs of stakeholders. By leveraging existing open tools, an expansive community network, and in-depth expertise, this collaborative team is well positioned to contribute to the Data Commons pilot and beyond.

show abstract

“…CiteSeerX has proven to be a rich source of scholarly information beyond publications as exemplified through various derived data-sets, ranging from citation graphs to publication acknowledgments [16], meant to aid academic content management and analysis research [1]. Furthermore, CiteSeerX's open-source nature allows easy access to its implementations of tools that span focused web crawling to record linkage [35] to meta-data extraction to leveraging user-provided meta-data corrections [31]. A key aspect of CiteSeerX 's future lies in not only serving as an engine for continuously building an ever-improving collection of scholarly knowledge at web-scale, but also as a set of publicly-available tools to aid those interested in building digital library and search engine systems of their own.…”

Section: Overview: Architecturementioning

confidence: 99%

Big Scholarly Data in CiteSeerX

Ororbia

Khabsa

et al. 2015

Proceedings of the 24th International Conference on World Wide Web

Self Cite

View full text Add to dashboard Cite

We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing largescale collections of scholarly documents from the world wide web. From the perspective of automatic information extraction and modes of alternative search, we examine various functional aspects of this complex system in order to investigate and explore ongoing and future research developments 1 .

show abstract

CiteSeerX: AI in a Digital Library Search Engine

Cited by 79 publications

References 36 publications

The Use of Simple Cellular Automata in Image Processing

The Use of Simple Cellular Automata in Image Processing

An Open, FAIRified Data Commons: Proposal for NIH Data Commons Pilot

Big Scholarly Data in CiteSeerX

Contact Info

Product

Resources

About