Qualitative effects of knowledge rules and user feedback in probabilistic data integration

Keulen, Maurice van; Keijzer, Ander de

doi:10.1007/s00778-009-0156-z

Cited by 42 publications

(31 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generating more XPaths may improve the ability to find the most suitable XPath. Also, the use of a probabilistic database approach may be able to more robustly address ambiguous situations [17,18].…”

Section: Discussionmentioning

confidence: 99%

Sample-based XPath Ranking for Web Information Extraction

Jundt¹,

Keulen²

2013

Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology

Self Cite

View full text Add to dashboard Cite

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper targets automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a 'search -search result page -detail page' setup. It is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.

show abstract

Section: Discussionmentioning

confidence: 99%

Sample-based XPath Ranking for Web Information Extraction

Jundt¹,

Keulen²

2013

Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology

Self Cite

View full text Add to dashboard Cite

show abstract

“…-A developer gradually and interactively defines an ontology with positive and negative knowledge about the correctness of certain (combinations of) annotations. At each iteration, added knowledge is immediately applied improving the extraction result until the result is good enough (see also [17]). -Storage, querying and manipulation of annotations should be scalable.…”

Section: Future Research Directionsmentioning

confidence: 99%

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Keulen

Habib

2014

Uncertainty Reasoning for the Semantic Web III

Self Cite

View full text Add to dashboard Cite

Abstract. Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. A main challenge of natural language is its ambiguity and vagueness. To automatically resolve ambiguity, the grammatical structure of sentences is used. However, when we move to informal language widely used in social media, the language becomes more ambiguous and thus more challenging for automatic understanding. Information Extraction (IE) is the research field that enables the use of unstructured text in a structured way. Named Entity Extraction (NEE) is a sub task of IE that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The goal of this paper is to provide an overview on some approaches that mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. The proposed methods open the doors for more sophisticated applications based on users' contributions on social media. We propose a robust combined framework for NEE and NED in semiformal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against the informality of the used language. We have discovered a reinforcement effect and exploited it a technique that improves extraction quality by feeding back disambiguation results. We present a method of handling the uncertainty involved in extraction to improve the disambiguation results.

show abstract

“…To represent the result of the integration, we need a way to capture the uncertainty in the schema mappings, in deduplication, or in resolving conflicting information. This uncertainty can be characterized by probabilistic mappings [26] and probabilistic data integration rules [38,39]. The outcome of the integration process can naturally be viewed as probabilistic XML (which is useful to query, update, and so on).…”

Section: Probabilistic Xml Applicationsmentioning

confidence: 99%

Probabilistic XML: Models and Complexity

Kimelfeld

Senellart

2013

Advances in Probabilistic Databases for Uncertain Information Management

View full text Add to dashboard Cite

Abstract. Uncertainty in data naturally arises in various applications, such as data integration and Web information extraction. Probabilistic XML is one of the concepts that have been proposed to model and manage various kinds of uncertain data. In essence, a probabilistic XML document is a compact representation of a probability distribution over ordinary XML documents. Various models of probabilistic XML provide different languages, with various degrees of expressiveness, for such compact representations. Beyond representation, probabilistic XML systems are expected to support data management in a way that properly reflects the uncertainty. For instance, query evaluation entails probabilistic inference, and update operations need to properly change the entire probability space. Efficiently and effectively accomplishing data-management tasks in that manner is a major technical challenge. This chapter reviews the literature on probabilistic XML. Specifically, this chapter discusses the probabilistic XML models that have been proposed, and the complexity of query evaluation therein. Also discussed are other data-management tasks like updates and compression, as well as systemic and implementation aspects.

show abstract

Qualitative effects of knowledge rules and user feedback in probabilistic data integration

Cited by 42 publications

References 29 publications

Sample-based XPath Ranking for Web Information Extraction

Sample-based XPath Ranking for Web Information Extraction

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Probabilistic XML: Models and Complexity

Contact Info

Product

Resources

About