Abstract:Propensity scores (PS) are an increasingly popular method to adjust for confounding in observational studies. PS methods have theoretical advantages over traditional covariate adjustment, but their relative performance in real-word scenarios is poorly characterized. We used datasets from four large-scale cardiovascular observational studies (PROMETHEUS, ADAPT-DES, THIN, and CHARM) to compare the performance of traditional covariate adjustment and four commonly used PS methods: matching, stratification, inverse probability weighting and use of propensity score as a covariate. We found that stratification performed poorly with few outcome events, and inverse probability weighting gave imprecise estimates of treatment effect and undue influence to a small number of observations when substantial confounding was present. Covariate adjustment and matching performed well in all of our examples, although matching tended to give less precise estimates in some cases. PS methods are not necessarily superior to traditional covariate adjustment, and care should be taken to select the most suitable method.