Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.