Abstract. We propose PolderCast, a P2P topic-based Pub/Sub system that is (a) fault-tolerant and robust, (b) scalable w.r.t the number of nodes interested in a topic and number of topics that nodes are interested in, and (c) fast in terms of dissemination latency while (d) attaining a low communication overhead. This combination of properties is provided by an implementation that blends deterministic propagation over maintained rings with probabilistic dissemination following a limited number of random shortcuts. The rings are constructed and maintained using gossiping techniques. The random shortcuts are provided by two distinct peer-sampling services: Cyclon generates purely random links while Vicinity produces interest-induced random links. We analyze PolderCast and survey it in the context of existing approaches. We evaluate PolderCast experimentally using real-world workloads from Twitter and Facebook traces. We use widely renowned Scribe [5] as a baseline in a number of experiments. Robustness with respect to node churn is evaluated through traces from the Skype superpeer network. We show that the experimental results corroborate all of the above properties in settings of up to 10K nodes, 10K topics, and 5K topics per-node.
Spotify is a peer-assisted music streaming service that has gained worldwide popularity. Apart from providing instant access to over 20 million music tracks, Spotify also enhances its users' music experience by providing various features for social interaction. These are realized by a system using the widely-adopted pub/sub paradigm. In this paper we provide an interesting case study of a hybrid pub/sub system designed for real-time as well as offline notifications for Spotify users. We firstly describe a multitude of use cases where pub/sub is applied. Secondly, we study the design of its pub/sub system used for matching, disseminating and persisting billions of publications every day. Finally, we study pub/sub traffic collected from the production system, derive characterizations of the pub/sub workload, and show some interesting findings and trends.
There has been significant progress in unsupervised network representation learning (UNRL) approaches over graphs recently with flexible random-walk approaches, new optimization objectives and deep architectures. However, there is no common ground for systematic comparison of embeddings to understand their behavior for different graphs and tasks. We argue that most of the UNRL approaches either model and exploit neighborhood or what we call context information of a node. These methods largely differ in their definitions and exploitation of context. Consequently, we propose a framework that casts a variety of approaches -random walk based, matrix factorization and deep learning based -into a unified context-based optimization function. We systematically group the methods based on their similarities and differences. We study their differences which we later use to explain their performance differences (on downstream tasks).We conduct a large-scale empirical study considering 9 popular and recent UNRL techniques and 11 real-world datasets with varying structural properties and two common tasks -node classification and link prediction. We find that for non-attributed graphs there is no single method that is a clear winner and that the choice of a suitable method is dictated by certain properties of the embedding methods, task and structural properties of the underlying graph. In addition we also report the common pitfalls in evaluation of UNRL methods and come up with suggestions for experimental design and interpretation of results. Comprehensive Experimental Evaluation.In our evaluation of UNRL methods we investigate the conceptual differences between the embedding approaches that result in performance differences on downstream tasks. First, using graphs with diverse structural characteristics we argue about the utility of several approaches. We carefully chose 11 large arXiv:1903.07902v5 [cs.LG]
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside . This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.
Misinformation such as fake news has drawn a lot of attention in recent years. It has serious consequences on society, politics and economy. This has lead to a rise of manually fact-checking websites such as Snopes and Politifact. However, the scale of misinformation limits their ability for verification. In this demonstration, we propose BRENDA a browser extension which can be used to automate the entire process of credibility assessments of false claims. Behind the scenes BRENDA uses a tested deep neural network architecture to automatically identify fact check worthy claims and classifies as well as presents the result along with evidence to the user. Since BRENDA is a browser extension, it facilities fast automated fact checking for the end user without having to leave the Webpage.
Publish/subscribe (pub/sub) is a popular communication paradigm in the design of largescale distributed systems. A provider of a pub/sub service (whether centralized, peer-assisted, or based on a federated organization of cooperatively managed servers) commonly faces a fundamental challenge: given limited resources, how to maximize the satisfaction of subscribers? We provide, to the best of our knowledge, the first formal treatment of this problem by introducing two metrics that capture subscriber satisfaction in the presence of limited resources. This allows us to formulate matters as two new flavors of maximum coverage optimization problems. Unfortunately, both variants of the problem prove to be NP-hard. By subsequently providing formal approximation bounds and heuristics, we show, however, that efficient approximations can be attained. We validate our approach using real-world traces from Spotify and show that our solutions can be executed periodically in real-time in order to adapt to workload variations.
Publish/subscribe (pub/sub) is a popular communication paradigm in the design of largescale distributed systems. A fundamental challenge in deploying pub/sub systems on a data center or a cloud infrastructure is efficient and cost-effective resource allocation that would allow delivery of notifications to all subscribers. In this paper, we provide answers to the following three fundamental questions: Given a pub/sub workload, (1) what is the minimum amount of resources needed to satisfy all the subscribers, (2) what is a cost-effective way to allocate resources for the given workload, and (3) what is the cost of hosting it on a public Infrastructure-as-a-Service (IaaS) provider like Amazon EC2.To answer these questions, we formulate a problem coined Minimum Cost Subscriber Satisfaction (MCSS). We prove MCSS to be NP-hard and provide an efficient heuristic solution based on a combination of optimizations. We evaluate the solution experimentally using real traces from Spotify and Twitter along with a pricing model from Amazon. We show the impact of each optimization using a naive solution as the baseline. Using a variety of practical scenarios for each dataset, we also show that our solution scales well for millions of subscribers and runs fast.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.