Consider the problem of estimating the Shannon entropy of a distribution over k elements from n independent samples. We show that the minimax mean-square error is within universal multiplicative constant factors of k n log k 2 + log 2 k n if n exceeds a constant factor of k log k ; otherwise there exists no consistent estimator. This refines the recent result of Valiant-Valiant [VV11a] that the minimal sample size for consistent entropy estimation scales according to Θ( k log k ). The apparatus of best polynomial approximation plays a key role in both the construction of optimal estimators and, via a duality argument, the minimax lower bound.
We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least 1 k . Under the independent sampling model, we show that the sample complexity, i.e., the minimal sample size to achieve an additive error of k with probability at least 0.1 is within universal constant factors of k log k log 2 1 , which improves the state-of-the-art result of k 2 log k in [VV13]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in O(n + log 2 k) time and attains the sample complexity within a factor of six asymptotically. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.
Big data analysis has found applications in many industries due to its ability to turn huge amounts of data into insights for informed business and operational decisions. Advanced data mining techniques have been applied in many sectors of supply chains in the food industry. However, the previous work has mainly focused on the analysis of instrument‐generated data such as those from hyperspectral imaging, spectroscopy, and biometric receptors. The importance of digital text data in the food and nutrition has only recently gained attention due to advancements in big data analytics. The purpose of this review is to provide an overview of the data sources, computational methods, and applications of text data in the food industry. Text mining techniques such as word‐level analysis (e.g., frequency analysis), word association analysis (e.g., network analysis), and advanced techniques (e.g., text classification, text clustering, topic modeling, information retrieval, and sentiment analysis) will be discussed. Applications of text data analysis will be illustrated with respect to food safety and food fraud surveillance, dietary pattern characterization, consumer‐opinion mining, new‐product development, food knowledge discovery, food supply‐chain management, and online food services. The goal is to provide insights for intelligent decision‐making to improve food production, food safety, and human nutrition.
Clos-based networks including Fat-tree and VL2 are being built in data centers, but existing per-flow based routing causes low network utilization and long latency tail. In this paper, by studying the structural properties of Fattree and VL2, we propose a per-packet round-robin based routing algorithm called Digit-Reversal Bouncing (DRB). DRB achieves perfect packet interleaving. Our analysis and simulations show that, compared with random-based loadbalancing algorithms, DRB results in smaller and bounded queues even when traffic load approaches 100%, and it uses smaller re-sequencing buffer for absorbing out-of-order packet arrivals. Our implementation demonstrates that our design can be readily implemented with commodity switches. Experiments on our testbed, a Fat-tree with 54 servers, confirm our analysis and simulations, and further show that our design handles network failures in 1-2 seconds and has the desirable graceful performance degradation property.
Abstract. We consider the distinct elements problem, where the goal is to estimate the number of distinct colors in an urn containing k balls based on n samples drawn with replacements. Based on discrete polynomial approximation and interpolation, we propose an estimator with additive error guarantee that achieves the optimal sample complexity within O.log log k/ factors, and in fact within constant factors for most cases. The estimator can be computed in O.n/ time for an accurate estimation. The result also applies to sampling without replacement provided the sample size is a vanishing fraction of the urn size.One of the key auxiliary results is a sharp bound on the minimum singular values of a real rectangular Vandermonde matrix, which might be of independent interest.Mathematics Subject Classification (2010). 62G05; 41A05, 41A10, 62C20, 62D05.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.