Influential Observations, High Leverage Points, and Outliers in Linear Regression

Chatterjee, Samprit; Hadi, Ali S.

doi:10.1214/ss/1177013622

Cited by 655 publications

(420 citation statements)

References 25 publications

Supporting

Mentioning

392

Contrasting

Unclassified

Order By: Relevance

“…3 are, up to scaling, equal to the diagonal elements of the so-called "hat matrix," i.e., the projection matrix onto the span of the top k right singular vectors of A (19,20). As such, they have a natural statistical interpretation as a "leverage score" or "influence score" associated with each of the data points (19)(20)(21). In particular, π j quantifies the amount of leverage or influence exerted by the jth column of A on its optimal low-rank approximation.…”

Section: Statistical Leverage and Improved Matrix Decompositionsmentioning

confidence: 99%

CUR matrix decompositions for improved data analysis

Mahoney

Drineas

2009

Proc. Natl. Acad. Sci. U.S.A.

624

558

View full text Add to dashboard Cite

Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high "statistical leverage" and, thus, in a very precise statistical sense, exert a disproportionately large "influence" on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.randomized algorithms | singular value decomposition | principal components analysis | interpretation | statistical leverage M odern datasets are often represented by large matrices since an m × n real-valued matrix A provides a natural structure for encoding information about m objects, each of which is described by n features. Examples of such objects include documents, genomes, stocks, hyperspectral images, and web groups. Examples of the corresponding features are terms, environmental conditions, temporal resolution, frequency resolution, and individual web users. In many cases, an important step in the analysis of such data is to construct a compressed representation of A that may be easier to analyze and interpret in light of a corpus of field-specific knowledge. The most common such representation is obtained by truncating the Singular Value Decomposition (SVD) at some number k min{m, n} terms. For example, Principal Components Analysis (PCA) is just this procedure applied to a suitably normalized data correlation matrix.Recall the SVD of a general matrix A ∈ R m×n . Given A,∈ R m and {v t } n t=1 ∈ R n are such that The SVD is widely used in data analysis, often via methods such as PCA, in large part because the subspaces spanned by the vectors (typically obtained after truncating the SVD to some small number k of terms) provide the best rank-k approximation to the data matrix A. If k ≤ r = rank(A) and we define A k = k t=1 σ t u t v t T , then A − A ...

show abstract

Section: Statistical Leverage and Improved Matrix Decompositionsmentioning

confidence: 99%

CUR matrix decompositions for improved data analysis

Mahoney

Drineas

2009

Proc. Natl. Acad. Sci. U.S.A.

624

558

View full text Add to dashboard Cite

show abstract

“…Concepts used in regression diagnostics such as influential observations, leverage points and hat matrices are solely based on the design matrix A (see e.g. Belsley et al 1980;Chatterjee and Hadi 1986;Barnett and Lewis 1994). Fortunately, the need of practical application of these concepts has led geodesists to incorporate the weight matrix into the measures related to these concepts (see e.g.…”

Section: The Effect Of Weights Of Observations On Robustnessmentioning

confidence: 99%

“…Identification and removal of contaminated data have been attempted and realized in two different ways, either (1) by first cleaning the data and then applying the classical least squares criterion to the remaining data (see e.g. Anscombe 1960;Baarda 1968;Pope 1976;Belsley et al 1980;Chatterjee and Hadi 1986;Barnett and Lewis 1994); or (2) by designing robust estimation criteria and applying them directly to contaminated data (see e.g. Huber 1981;Hampel et al 1986;Jurecková and Sen 1996;Koch 1999).…”

Section: Introductionmentioning

confidence: 99%

“…Although the idea of cleaning data is very old, statistical procedures for detecting outliers practically work well only in the case of one single outlier but can often fail in the case of multiple outliers, in particular, if outliers are masked (see e.g. Chatterjee and Hadi 1986;Rocke and Woodruff 1996). If an initial robust estimate of the model parameters can be obtained, correct detection of multiple outliers is indeed possible by using a two-phase approach (see e.g.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sign-constrained robust least squares, subjective breakdown point and the effect of weights of observations on robustness

2005

J Geodesy

122

View full text Add to dashboard Cite

The findings of this paper are summarized as follows: (1) We propose a sign-constrained robust estimation method, which can tolerate 50% of data contamination and meanwhile achieve high, least-squares-comparable efficiency. Since the objective function is identical with least squares, the method may also be called sign-constrained robust least squares. An iterative version of the method has been implemented and shown to be capable of resisting against more than 50% of contamination. As a by-product, a robust estimate of scale parameter can also be obtained. Unlike the least median of squares method and repeated medians, which use a least possible number of data to derive the solution, the signconstrained robust least squares method attempts to employ a maximum possible number of good data to derive the robust solution, and thus will not be affected by partial near multicollinearity among part of the data or if some of the data are clustered together; (2) although M-estimates have been reported to have a breakdown point of 1/(t + 1), we have shown that the weights of observations can readily deteriorate such results and bring the breakdown point of M-estimates of Huber's type to zero. The same zero breakdown point of the L 1 -norm method is also derived, again due to the weights of observations; (3) by assuming a prior distribution for the signs of outliers, we have developed the concept of subjective breakdown point, which may be thought of as an extension of stochastic breakdown by Donoho and Huber but can be important in explaining real-life problems in Earth Sciences and image reconstruction; and finally, (4) We have shown that the least median of squares method can still break down with a single outlier, even if no highly concentrated good data nor highly concentrated outliers exist.

show abstract

“…Feature extraction and dimension reduction can be combined in one step by using ''multiple linear regression'' model which performs a mapping of the multi-regime data to a lower-dimensional space in such a way that the variance of the measurements in the low-dimensional finding is maximized. Multiple linear regression calculates the relationship between different explanatory variables and a target variable by fitting a linear equation to observed data [15,16]. This model is based on:…”

Section: Signal Processing and Dimensionality Reductionmentioning

confidence: 99%

Reducing Dimensionality of Multi-regime Data for Failure Prognostics

Bektaş

Alfudail

Jones

2017

J Fail. Anal. and Preven.

View full text Add to dashboard Cite

Over the last decade, the prognostics and health management literature has introduced many conceptual frameworks for remaining useful life predictions. However, estimating the future behavior of critical machinery systems is a challenging task due to the uncertainties and complexity involved in the multi-dimensional condition monitoring data. Even though many studies have reported promising methods in data processing and dimensionality reduction, the prognostics applications require integration of these methods with remaining useful life estimations. This paper describes a multiple linear regression process that reduces the number of data regimes under consideration by obtaining a set of principal degradation variables. The process also extracts health indicators and useful features. Finally, a state-space model based on frequencydomain data is used to estimate remaining useful life. The presented approach is assessed with a case study on turbofan engine degradation simulation dataset, and the prediction performance is validated by error-based prognostic metrics.

show abstract

Influential Observations, High Leverage Points, and Outliers in Linear Regression

Cited by 655 publications

References 25 publications

CUR matrix decompositions for improved data analysis

CUR matrix decompositions for improved data analysis

Sign-constrained robust least squares, subjective breakdown point and the effect of weights of observations on robustness

Reducing Dimensionality of Multi-regime Data for Failure Prognostics

Contact Info

Product

Resources

About