Given training sequences generated by two distinct, but unknown, distributions sharing a common alphabet, we study the problem of determining whether a third test sequence was generated according to the first or second distribution using only the training data. To better model sources such as natural language, for which the underlying distributions are difficult to learn, we allow the alphabet size to grow and therefore the probability distributions to change with the blocklength. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime we give conditions on the alphabet growth rate and distributions guaranteeing the existence of universally consistent tests, i.e. tests having a probability of error tending to zero with the blocklength for any underlying distributions. We show that some commonly used statistical tests are universally consistent provided the alphabet is sub-linear but these tests are inconsistent for linear growth rates. We then propose a classifier that is universally consistent with up-to quadratic alphabet growth and that no classifier can handle the case in which the alphabet grows quadratically or faster. If the tester is given the underlying distributions in place of the training data, we prove that consistent testing is possible regardless of the growth of the underlying alphabet. Our results are then used to illuminate the problem of classifying arbitrary (i.e. non-homogeneous) distributions on growing alphabets.
Given training sequences generated by two distinct, but unknown distributions sharing a common alphabet, we seek a classifier that can correctly decide whether a third test sequence is generated by the first or second distribution using only the training data. To model 'limited learning' we allow the alphabet size to grow and therefore probability distributions to change with the blocklength. We prove that a natural choice, namely a generalized likelihood ratio test, is universally consistent (has a probability of error tending to zero with the blocklength for all underlying distributions) when the alphabet size is sub-linear in the blocklength, but inconsistent for linear alphabet growth. For up-to quadratic alphabet growth, in a regime where all probabilities are of the same order, we prove the universally consistency of a new test and show there are no such tests when the alphabet grows quadratically or faster.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.