When applying aggregating strategies to Prediction with Expert Advice, the learning rate must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative "Follow the Perturbed Leader" (FPL) algorithm from [KV03] (based on Hannan's algorithm) is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are (to our knowledge) new.
KeywordsPrediction with Expert Advice, Follow the Perturbed Leader, general weights, adaptive learning rate, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction,
For a strongly continuous semigroup (T(t)) t 0 with generator A we introduce its critical spectrum _ crit (T(t)). This yields in an optimal way the spectral mapping theorem _(T(t))=e t_(A) _ _ crit (T(t)) and improves classical stability results.
2000Academic Press
We study the properties of the Minimum Description Length principle for sequence prediction, considering a two-part MDL estimator which is chosen from a countable class of models. This applies in particular to the important case of universal sequence prediction, where the model class corresponds to all algorithms for some fixed universal Turing machine (this correspondence is by enumerable semimeasures, hence the resulting models are stochastic). We prove convergence theorems similar to Solomonoff's theorem of universal induction, which also holds for general Bayes mixtures. The bound characterizing the convergence speed for MDL predictions is exponentially larger as compared to Bayes mixtures. We observe that there are at least three different ways of using MDL for prediction. One of these has worse prediction properties, for which predictions only converge if the MDL estimator stabilizes. We establish sufficient conditions for this to occur. Finally, some immediate consequences for complexity relations and randomness criteria are proven.
Abstract. Clustering algorithms based on a matrix of pairwise similarities (kernel matrix) for the data are widely known and used, a particularly popular class being spectral clustering algorithms. In contrast, algorithms working with the pairwise distance matrix have been studied rarely for clustering. This is surprising, as in many applications, distances are directly given, and computing similarities involves another step that is error-prone, since the kernel has to be chosen appropriately, albeit computationally cheap. This paper proposes a clustering algorithm based on the SDP relaxation of the max-k-cut of the graph of pairwise distances, based on the work of Frieze and Jerrum. We compare the algorithm with Yu and Shi's algorithm based on spectral relaxation of a norm-k-cut. Moreover, we propose a simple heuristic for dealing with missing data, i.e., the case where some of the pairwise distances or similarities are not known. We evaluate the algorithms on the task of clustering natural language terms with the Google distance, a semantic distance recently introduced by Cilibrasi and Vitányi, using relative frequency counts from WWW queries and based on the theory of Kolmogorov complexity.
Abstract-Minimum description length (MDL) is an important principle for induction and prediction, with strong relations to optimal Bayesian learning. This paper deals with learning processes which are independent and identically distributed (i.i.d.) by means of two-part MDL, where the underlying model class is countable. We consider the online learning framework, i.e., observations come in one by one, and the predictor is allowed to update its state of mind after each time step. We identify two ways of predicting by MDL for this setup, namely, a static and a dynamic one. (A third variant, hybrid MDL, will turn out inferior.) We will prove that under the only assumption that the data is generated by a distribution contained in the model class, the MDL predictions converge to the true values almost surely. This is accomplished by proving finite bounds on the quadratic, the Hellinger, and the Kullback-Leibler loss of the MDL learner, which are, however, exponentially worse than for Bayesian prediction. We demonstrate that these bounds are sharp, even for model classes containing only Bernoulli distributions. We show how these bounds imply regret bounds for arbitrary loss functions. Our results apply to a wide range of setups, namely, sequence prediction, pattern classification, regression, and universal induction in the sense of algorithmic information theory among others.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.