XML Duplicate Detection Using Sorted Neighborhoods

Puhlmann, Sven; Weis, Melanie; Naumann, Felix

doi:10.1007/11687238_46

Cited by 38 publications

(33 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…adaptive vs. constant) for maximal performance (Puhlmann,Weis & 19 A reasonable assumption, since a window size of < 10 was found to be empirically sufficient (Hernández & Stolfo, 1998). Naumann, 2006;Yan, Lee, Kan & Giles, 2007). A major trend has been the proposal of SN algorithms that run on distributed architectures Ma & Yang, 2015).…”

Section: Sorted Neighborhoodmentioning

confidence: 99%

Populating a linked data entity name system

Kejriwal¹

2017

AI Matters

View full text Add to dashboard Cite

Resource Description Framework (RDF) is a graph-based data model used to publish data as a Web of Linked Data (Bizer et al . 2009). RDF is an emergent foundation for large-scale data integration , the problem of providing a unified view over multiple data sources. The structure in RDF data can be conveniently visualized using directed labeled graphs , as illustrated in the real-world graph fragments in Figure 1. Nodes in the graph represent entities (e.g. the node with label dbpedia:Allen_, Paul represents the entity Paul Allen in the DBpedia knowledge graph) and edges represent either attributes of an entity (e.g. '01/21/1953' is the birthdate of Paul Allen) or relationships between two entities (e.g. Paul Allen is the co-founder of the company entity, Microsoft). Facts in the knowledge base are formally represented as a set of triples , with a triple comprising a labeled edge (denoted as a property ) in the RDF graph along with its incoming and outgoing nodes.

show abstract

Section: Sorted Neighborhoodmentioning

confidence: 99%

Populating a linked data entity name system

Kejriwal¹

2017

AI Matters

View full text Add to dashboard Cite

show abstract

“…al. [10] proposed unique method which automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure of objects reflects the relative importance of the attributes in the database and avoids doing the manual selection.…”

Section: Gduplicate Detection Through Structure Optimizationmentioning

confidence: 99%

“…Nevertheless, and because of its more general nature, their approach does not take advantage of the useful features existing in XML databases, such as the element structure or tag semantics. Only more recently has research been performed with the specific goal of discovering duplicate object representations in XML databases [5], [6], [8], [10]. These works differ from previous approaches since they were specifically designed to exploit the distinctive characteristics of XML object representations: their structure, textual content, and the semantics implicit in the XML labels.…”

Section: IIImentioning

confidence: 99%

See 1 more Smart Citation

EDDDS: An Efficient Duplicate Data Detection System

Dhake¹,

S.S.²,

Y.R.³

et al. 2015

International Journal of Advanced Research in Computer and Comm

View full text Add to dashboard Cite

Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the duplicates in relational data, in complex data and hierarchical data like XML. There are lots of works already presented in the past for finding the duplicates in the relational data. But nowadays there is more focus on finding duplicates in the XML data. Because of XML is very popular for data storing and extensively used for data exchange between the organizations. Here we have done an extensive literature survey on this topic and proposed a duplicate detection method that incorporates some of the existing paper's ideas and some of our original ideas. In addition to improving the efficiency and effectiveness, we also checks for its typographical errors when comparing the two XML elements. To test the correctness of our method, we are comparing it with existing duplicate detection system, and giving more focus on how we get higher precision and recall values in the various datasets we have used.

show abstract

“…The problem discussed in this paper can be viewed as a graph-based deduplication problem (see [21] for a recent survey), where one of the objects under study is described in a structured form (the enterprise ontology), whereas the other is described in an unstructured fashion (forum entries, query). Recent work has started to address less rigidly structured instances, such as XML objects (e.g., [45]). We are not aware of deduplication approaches encompassing unstructured and structured data.…”

Section: Related Workmentioning

confidence: 99%

Graph-based concept identification and disambiguation for enterprise search

Brauer

Huber²,

Hackenbroich

et al. 2010

Proceedings of the 19th International Conference on World Wide Web

Self Cite

View full text Add to dashboard Cite

Enterprise Search (ES) is different from traditional IR due to a number of reasons, among which the high level of ambiguity of terms in queries and documents and existence of graph-structured enterprise data (ontologies) that describe the concepts of interest and their relationships to each other, are the most important ones.Our method identifies concepts from the enterprise ontology in the query and corpus. We propose a ranking scheme for ontology sub-graphs on top of approximately matched token q-grams. The ranking leverages the graph-structure of the ontology to incorporate not explicitly mentioned concepts. It improves previous solutions by using a fine-grained ranking function that is specifically designed to cope with high levels of ambiguity. This method is able to capture much more of the semantics of queries and documents than previous techniques. We prove this claim by an evaluation of our method in three real-life scenarios from two different domains, and found it to consistently be superior both in terms of precision and recall.

show abstract

XML Duplicate Detection Using Sorted Neighborhoods

Cited by 38 publications

References 15 publications

Populating a linked data entity name system

Populating a linked data entity name system

EDDDS: An Efficient Duplicate Data Detection System

Graph-based concept identification and disambiguation for enterprise search

Contact Info

Product

Resources

About