Coreset (or core-set) in this paper is a small weighted subset Q of the input set P with respect to a given monotonic function φ : R → R that provably approximates its fitting loss p∈P f (p • x) to any given x ∈ R d . Using Q we can obtain approximation of x * that minimizes this loss, by running existing optimization algorithms on Q. We provide: (I) a lower bound that proves that there are sets with no coresets smaller than n = |P | , (II) a proof that a small coreset of size near-logarithmic in n exists for any input P , under natural assumption that holds e.g. for logistic regression and the sigmoid activation function. (III) a generic algorithm that computes Q in O(nd + n log n) expected time, (IV) extensive experimental results with open code and benchmarks that show that the coresets are even smaller in practice. Existing papers (e.g. [18]) suggested only specific coresets for specific input sets.
MotivationTraditional algorithms in computer science and machine learning are usually tailored to handle only off-line finite data set that is stored in memory. However, many modern systems do not use this computation model.For example, GPS data from millions of smartphones, high definition images, YouTube videos, Twitter's text twitts, or audio signals from music or speech recognition arrive in a streaming fashion. The era of Internet of Things (IoT) provides us with wearable devices and mini-computers that collect data sets that are being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), genome sequencing, cameras, microphones, radio-frequency identification chips, finance (such as stocks) logs, internet search, and wireless sensor networks [17,22,11].Limited memory. In such systems the input is an infinite stream that may be grown in practive to peta bytes of data-sets, and cannot be stored in memory. The data may arrive in real-time, and not just being read from a hard drive, so only one-pass over data and small memory is allowed.Parallel computations. Even if we have streaming algorithms to maintain and learn Big data in memory from million of users, it is not reasonable to apply them on our laptop, and a large set of computation machines is used instead. However, using, for example, GPUs that run thousands of threads in parallel require us to design parallel version of our algorithms, which may be very hard to design and debug.
Distributed computations.If the data-set is distributed among many machines, e.g. network or "cloud", there is an additional problem of non-shared memory, which may be replaced by expensive and slow communication between the machines.
Limited computation power.Modern computation devices such as GPUs pose additional challenges since in order to run efficiently in parallel, unlike CPUs, only limited set of simple commands and algorithms may be used. However, unlike modern GPU cards that are plugged into expensive and strong servers on 1