Fatih Gelgi scite author profile

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

show abstract

Semantic Partitioning of Web Pages

Vadrevu

Gelgi

Davulcu

2005

View full text Add to dashboard Cite

In this paper we describe the semantic partitioner algorithm, that uses the structural and presentation regularities of the Web pages to automatically transform them into hierarchical content structures. These content structures enable us to automatically annotate labels in the Web pages with their semantic roles, thus yielding meta-data and instance information for the Web pages. Experimental results with the TAP knowledge base and computer science department Web sites, comprising 16, 861 Web pages indicate that our algorithm is able gather meta-data accurately from various types of Web pages. The algorithm is able to achieve this performance without any domain specific engineering requirement.

show abstract

Automated Situation-Aware Service Composition in Service-Oriented Computing

Yau

Davulcu

Mukhopadhyay

et al. 2007

View full text Add to dashboard Cite

Service-based systems have many applications, such as e-business, health care, and homeland security. In these systems, it is necessary to provide users the capability of composing services into workflows providing higher-level functionality. In dynamic service-oriented computing environments, it is desirable that service composition is automated and situation-aware to generate robust and adaptive workflows. In this paper, an automated situation-aware service composition approach is presented. This approach is based on the a-logic, a-calculus, and a declarative model for situation awareness (SAW). This approach consists of four major components: (1) analyzing SAW requirements using our SAW model, (2) translating our SAW model representation to a-logic specifications and specifying a control flow graph in a-logic as the service composition goal, (3) automated synthesis of a-calculus terms defining situation-aware workflow agents based on a-logic specifications for SAW requirements and the control flow graph, and (4) compilation of a-calculus terms to executable components.

show abstract

Automated Metadata and Instance Extraction from News Web Sites

Vadrevu

Nagarajan

Gelgi

et al.

View full text Add to dashboard Cite

show abstract

Improving Web Data Annotations with Spreading Activation

Gelgi

Vadrevu

Davulcu

2005

View full text Add to dashboard Cite

The Web has established itself as the largest public data repository ever available. Even though the vast majority of information on the Web is formatted to be easily readable by the human eye, "meaningful information" is still largely inaccessible for the computer applications. In this paper, we present automated algorithms to gather meta-data and instance information by utilizing global regularities on the Web and incorporating with contextual information. Experimental evaluations successfully performed on the TAP knowledge base and the faculty-course home pages of computer science departments containing 16,861 Web pages. The system achieves this performance without any domain specific engineering requirement.

show abstract

A Risk Reduction Framework for Dynamic Workflows

Singh

Gelgi

Davulcu

et al. 2008

View full text Add to dashboard Cite

Heuristics for Minimum Brauer Chain Problem

Gelgi

Onus

2006

View full text Add to dashboard Cite

Fixing Weakly Annotated Web Data Using Relational Models

Gelgi

Vadrevu

Davulcu

View full text Add to dashboard Cite

Abstract. In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data -which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Fatih Gelgi

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Semantic Partitioning of Web Pages

Automated Situation-Aware Service Composition in Service-Oriented Computing

Automated Metadata and Instance Extraction from News Web Sites

Improving Web Data Annotations with Spreading Activation

A Risk Reduction Framework for Dynamic Workflows

Heuristics for Minimum Brauer Chain Problem

Fixing Weakly Annotated Web Data Using Relational Models

Contact Info

Product

Resources

About