CiteSeer x : A Scholarly Big Dataset

Caragea, Cornelia; Wu, Jian; Ciobanu, Alina Maria; Williams, Kyle; Fernández-Ramírez, Juan; Chen, Hung-Hsuan; Wu, Zhaohui; Giles, Lee

doi:10.1007/978-3-319-06028-6_26

Cited by 43 publications

(30 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used the implementation of topic models from Mallet. 3 To train the topic 3 http://mallet.cs.umass.edu/ model, we used a subset of about 45, 000 paper abstracts extracted from the CiteSeer x scholarly big dataset introduced by Caragea et al (2014b). For all models, the score of a phrase is obtained by summing the score of the constituent words in the phrase.…”

Section: Resultsmentioning

confidence: 99%

PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents

Florescu¹,

Caragea²

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

Self Cite

258

170

View full text Add to dashboard Cite

The large and growing amounts of online scholarly data present both challenges and opportunities to enhance knowledge discovery. One such challenge is to automatically extract a small set of keyphrases from a document that can accurately describe the document's content and can facilitate fast information processing. In this paper, we propose PositionRank, an unsupervised model for keyphrase extraction from scholarly documents that incorporates information from all positions of a word's occurrences into a biased PageRank. Our model obtains remarkable improvements in performance over PageRank models that do not take into account word positions as well as over strong baselines for this task. Specifically, on several datasets of research papers, PositionRank achieves improvements as high as 29.09%.

show abstract

Section: Resultsmentioning

confidence: 99%

PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents

Florescu¹,

Caragea²

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

Self Cite

258

170

View full text Add to dashboard Cite

show abstract

“…Fang et al's method [10] focuses on the header detection of different tables available in PDF documents collected from the dataset of CiteSeer [11]. Some techniques focused on table exploring by using table layout characteristics; however, table structure mattered a lot.…”

Section: Table and Header Detectionmentioning

confidence: 99%

A Novel Approach to Data Extraction on Hyperlinked Webpages

2019

View full text Add to dashboard Cite

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

show abstract

“…Another possibility is to use a digital library, from where the documents and metadata can be obtained in a more straightforward manner. One such digital library is CiteSeerX [5,17], which offers an OAI collection for metadata harvesting. also offer a huge amount of bibliographic data.…”

Section: A Hybrid Approach For Metadata Extractionmentioning

confidence: 99%

The Use of Simple Cellular Automata in Image Processing

Bodó

Csató

2017

Studia UBB Informatica

View full text Add to dashboard Cite

Abstract. Metadata extraction from documents forms an essential part of web or desktop search systems. Similarly, digital libraries that index scholarly literature require to find and extract the title, the list of authors and other publication-related information from an article. We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training. An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.

show abstract

CiteSeer x : A Scholarly Big Dataset

Cited by 43 publications

References 21 publications

PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents

PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents

A Novel Approach to Data Extraction on Hyperlinked Webpages

The Use of Simple Cellular Automata in Image Processing

Contact Info

Product

Resources

About