Leong Hou U scite author profile

et al. 2015

Discovering motifs in sequence databases has been receiving abundant attentions from both database and data mining communities, where the motif is the most correlated pair of subsequences in a sequence object. Motif discovery is expensive for emerging applications which may have very long sequences (e.g., million observations per sequence) or the queries arrive rapidly (e.g., per 10 seconds). Prior works cannot offer fast correlation computations and prune subsequence pairs at the same time, as these two techniques require different orderings on examining subsequence pairs. In this work, we propose a novel framework named Quick-Motif which adopts a two-level approach to enable batch pruning at the outer level and enable fast correlation calculation at the inner level. We further propose two optimization techniques for the outer and the inner level. In our experimental study, our method is up to 3 orders of magnitude faster than the state-of-the-art methods.

Identifying points of interest by self-tuning clustering

Yang

Gong

2011

Deducing trip related information from web-scale datasets has received very large amounts of attention recently. Identifying points of interest (POIs) in geo-tagged photos is one of these problems. The problem can be viewed as a standard clustering problem of partitioning two dimensional objects. In this work, we study spectral clustering which is the first attempt for the POIs identification. However, there is no unified approach to assign the clustering parameters; especially the features of POIs are immensely varying in different metropolitans and locations. To address this, we are intent to study a self-tuning technique which can properly assign the parameters for the clustering needed.Besides geographical information, web photos inherently store rich information. These information are mutually influenced each others and should be taken into trip related mining tasks. To address this, we study reinforcement which constructs the relationship over multiple sources by iterative learning. At last, we thoroughly demonstrate our findings by web scale datasets collected from Flickr.

Discovering longest-lasting correlation in sequence databases

et al. 2013

Proc. VLDB Endow.

Most existing work on sequence databases use correlation (e.g., Euclidean distance and Pearson correlation) as a core function for various analytical tasks. Typically, it requires users to set a length for the similarity queries. However, there is no steady way to define the proper length on different application needs. In this work we focus on discovering longest-lasting highly correlated subsequences in sequence databases, which is particularly useful in helping those analyses without prior knowledge about the query length. Surprisingly, there has been limited work on this problem. A baseline solution is to calculate the correlations for every possible subsequence combination. Obviously, the brute force solution is not scalable for large datasets. In this work we study a space-constrained index that gives a tight correlation bound for subsequences of similar length and offset by intra-object grouping and inter-object grouping techniques. To the best of our knowledge, this is the first index to support normalized distance metric of arbitrary length subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.

Efficient proximity detection among mobile users via self-tuning policies

Šaltenis

et al. 2010

Proc. VLDB Endow.

Given a set of users, their friend relationships, and a distance threshold per friend pair, the proximity detection problem is to find each pair of friends such that the Euclidean distance between them is within the given threshold. This problem plays an essential role in friend-locator applications and massively multiplayer online games. Existing proximity detection solutions either incur substantial location update costs or their performance does not scale well to a large number of users. Motivated by this, we present a centralized proximity detection solution that assigns each mobile client with a mobile region. We then design a self-tuning policy to adjust the radius of the region automatically, in order to minimize communication cost. In addition, we analyze the communication cost of our solutions, and provide valuable insights on their behaviors. Extensive experiments suggest that our proposed solution is efficient and robust with respect to various parameters.

An experimental study on hub labeling based shortest path algorithms

et al. 2017

Proc. VLDB Endow.

Shortest path distance retrieval is a core component in many important applications. For a decade, hub labeling (HL) techniques have been considered as a practical solution with fast query response time (e.g., 1--3 orders of magnitude faster), competitive indexing time, and slightly larger storage overhead (e.g., several times larger). These techniques enhance query throughput up to hundred thousands queries per second, which is particularly helpful in large user environment. Despite the importance of HL techniques, we are not aware of any comprehensive experimental study on HL techniques. Thus it is difficult for a practitioner to adopt HL techniques for her applications. To address the above issues, we provide a comprehensive experimental study on the state-of-the-art HL technique with analysis of their efficiency, effectiveness and applicability. From insightful summary of different HL techniques, we further develop a simple yet effective HL techniques called Significant path based Hub Pushing (SHP) which greatly improves indexing time of previous techniques while retains good query performance. We also complement extensive comparisons between HL techniques and other shortest path solutions to demonstrate robustness and efficiency of HL techniques.