Abstract-There is an especially strong need in modern largescale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.
We discuss experimental and numerical studies of the effects of Lagrangian chaos (chaotic advection) on the stretching of a drop of an immiscible impurity in a flow. We argue that the standard capillary number used to describe this process is inadequate since it does not account for advection of a drop between regions of the flow with varying velocity gradient. Consequently, we propose a Lagrangiangeneralized capillary number C L number based on finite-time Lyapunov exponents. We present preliminary tests of this formalism for the stretching of a single drop of oil in an oscillating vortex flow, which has been shown previously to exhibit Lagrangian chaos. Probability distribution functions (PDFs) of the stretching of this drop have features that are similar to PDFs of C L . We also discuss on-going experiments that we have begun on drop stretching in a blinking vortex flow.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.