There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without topk approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance.In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes available. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms.
Social network services (SNSs) such as Twitter and Facebook have emerged as a new medium for communication. They offer a unique mechanism of sharing information by allowing users to receive all messages posted by those whom they ''follow''. As information in today's SNSs often spreads in the form of hashtags, detecting rapidly spreading hashtags in SNSs has recently attracted much attention. In this paper, we propose realistic epidemic models to describe the probabilistic process of hashtag propagation. Our models take into account the way how users communicate in SNSs; moreover the models consider the influence of external media and separate it from internal diffusion within networks. Based on the proposed models, we develop efficient inference algorithms that measure the propagation rates of hashtags in social networks. With real-life social network data including hashtags and synthetic data obtained by simulating information diffusion, we show that the proposed algorithms find fast-spreading hashtags more accurately than existing algorithms. Moreover, our in-depth case study demonstrates that our algorithms correctly find internal diffusion rates of hashtags as well as external media influences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.