Structured entities are commonly abstracted, such as from XML, RDF or hidden-web databases. Direct retrieval of various structured entities is highly demanded in data lakes, e.g., given a JSON object, to find the XML entities that denote the same real-world object. Existing approaches on evaluating structured entity similarity emphasize too much the structural inconsistency. Indeed, entities from heterogeneous sources could have very distinct structures, owing to various information representation conventions. We argue that the retrieval could be more tolerant to structural differences and focus more on the contents of the entities. In this paper, we first identify the unique challenge of parent-child (containment) relationships among structured entities, which unfortunately prevent the retrieval of proper entities (returning parents or children). To solve the problem, a novel hierarchy smooth function is proposed to combine the term scores in different nodes of a structured entity. Entities sharing the same structure, namely an entity family, are employed to learn the coefficient in aggregating the scores, and thus distinguish/prune the parent or child entities. Remarkably, the proposed method could cooperate with both the bag-of-words (BOW) and word embedding models, successful in retrieving unstructured documents, for querying structured entities. Extensive experiments on real datasets demonstrate that our proposal is effective and efficient.
For large-scale, and real-time processing, cloud systems are widely used due to their high scalability and availability. In this paper, we propose workload distribution methods for location-based event processing on the cloud systems. We define a measure of the workload, and focus on the balanced distribution of workload because in cloud systems the workload distribution is very important with respect to the system performance. For the balanced distribution of workload, we propose four methods: (1) round-robin data distribution, (2) round-robin query distribution, (3) data/query distribution via space partitioning and (4) skew-aware distribution. The roundrobin data distribution method focuses on a balanced distribution of event data whereas queries are replicated in all cluster nodes. In the round-robin query distribution method, queries are evenly distributed whereas the event is replicated. The data/query distribution via space partitioning distributes event data and queries based on their spatial attribute values. Lastly, the skew-aware distribution method considers the non-uniformity of event data and queries. With extensive experiments, we evaluate the performances of our proposed methods
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.