Model Checking in Large-Scale Dataset via Structure-Adaptive-Sampling

Han, Yi; Ma, Pibo; Ren, Haojie; Wang, Zhaojun

doi:10.5705/ss.202020.0303

Cited by 3 publications

(1 citation statement)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, the statistical leveraging framework (Drineas et al, 2012;Ma et al, 2015bMa et al, , 2022Li and Meng, 2020) has achieved great success in large-scale ordinary least squares regression. More recently, optimal subsampling procedures have been also established for various statistical models, including logistic regression (Wang et al, 2018), generalized linear models (Ai et al, 2018;Yu et al, 2022), quantile regression (Wang and Ma, 2021), nonparametric regression (Ma et al, 2015a;Meng et al, 2020Meng et al, , 2021, and designed for testing problems (Ren et al, 2022;Han et al, 2023). However, none of the existing can be directly applied to SVM due to its distinguishing geometric feature.…”

Section: Introductionmentioning

confidence: 99%

Leverage Classifier: Another Look at Support Vector Machine

Han,

Yu,

Zhang

et al. 2025

STAT SINICA

View full text Add to dashboard Cite

Support vector machine (SVM) is a popular classifier known for accuracy, flexibility, and robustness. However, its intensive computation has hindered its application to large-scale datasets. In this paper, we propose a new optimal leverage classifier based on linear SVM under a nonseparable setting. Our classifier aims to select an informative subset of the training sample to reduce data size, enabling efficient computation while maintaining high accuracy. We take a novel view of SVM under the general subsampling framework and rigorously investigate the statistical properties. We propose a two-step subsampling procedure consisting of a pilot estimation of the optimal subsampling probabilities and a subsampling step to construct the classifier. We develop a new Bahadur representation of the SVM coefficients and derive unconditional asymptotic distribution and optimal subsampling probabilities without giving the full sample. Numerical results demonstrate that our classifiers outperform the existing methods in terms of estimation, computation, and prediction.

show abstract