Data-Driven Metric Development for Online Controlled Experiments

Deng, Alex; Shi, Xiaolin

doi:10.1145/2939672.2939700

Cited by 66 publications

(33 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…While o ine metrics are especially valuable when evaluating a system in prior to its deployment [13,34], online metrics have been widely adopted for modern search engines because such metrics are calculated based on the interactions between practical users and systems. Inspired by previous research on metrics meta-evaluation [9,11,15,19], we compare the evaluation performance of some most widely-used online metrics, including:…”

Section: Comparison Across O Line Metricsmentioning

confidence: 99%

See 1 more Smart Citation

Meta-evaluation of Online and Offline Web Search Evaluation Metrics

Chen

Zhou

Liu

et al. 2017

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both o ine and online evaluation metrics are adopted in measuring the performance of search engines. O ine metrics are usually based on relevance judgments of query-document pairs from assessors while online metrics exploit the user behavior data, such as clicks, collected from search engines to compare search algorithms. Although both types of IR evaluation metrics have achieved success, to what extent can they predict user satisfaction still remains under-investigated. To shed light on this research question, we meta-evaluate a series of existing online and o ine metrics to study how well they infer actual search user satisfaction in di erent search scenarios. We nd that both types of evaluation metrics signi cantly correlate with user satisfaction while they re ect satisfaction from di erent perspectives for di erent search tasks. O ine metrics be er align with user satisfaction in homogeneous search (i.e. ten blue links) whereas online metrics outperform when vertical results are federated. Finally, we also propose to incorporate mouse hover information into existing online evaluation metrics, and empirically show that they be er align with search user satisfaction than click-based online metrics.

show abstract

Section: Comparison Across O Line Metricsmentioning

confidence: 99%

“…Recent studies show that assessors' judgments may signi cantly di er from users' assessments [31]. e second problem is that the evaluation results based on o ine metrics can be biased because they are usually generated with a small and incomplete dataset [13].…”

mentioning

confidence: 99%

Meta-evaluation of Online and Offline Web Search Evaluation Metrics

Chen

Zhou

Liu

et al. 2017

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

show abstract

“…Measuring change is a common theme in applied data science. In online A/B testing [15,17,36,37,44], we estimate the average treatment effect (ATE) by the difference of the same metric measured from treatment and control groups, respectively. In time series analyses and longitudinal studies, we often track a metric over time and monitor changes between different time points.…”

Section: Inferring Percent Changes 21 Percent Change and Fieller Intmentioning

confidence: 99%

Applying the Delta Method in Metric Analytics

Deng

Knoblich

2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

Self Cite

View full text Add to dashboard Cite

During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly without extra model assumptions and solve big data problems via closed-form formulas using distributed algorithms at a fraction of the cost of simulation-based procedures like bootstrap. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics by highlighting both its classic and novel applications.

show abstract

“…Several previous papers discuss the adjustment of metric variance in A/B testing. The delta method [16] and Bootstrap method [17] are two variance estimation approaches that can be applied to correct the variance without the assumption of independence. These two methods work well in theory, however, they require storing raw data (e.g.…”

Section: B Variance Estimationmentioning

confidence: 99%

Safely and Quickly Deploying New Features with a Staged Rollout Framework Using Sequential Test and Adaptive Experimental Design

Zhao

Liu

Deb

2018

2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)

View full text Add to dashboard Cite

During the rapid development cycle for Internet products (websites and mobile apps), new features are developed and rolled out to users constantly. Features with code defects or design flaws can cause outages and significant degradation of user experience. The traditional method of code review and change management can be time-consuming and error-prone. In order to make the feature rollout process safe and fast, this paper proposes a methodology for rolling out features in an automated way using an adaptive experimental design. Under this framework, a feature is gradually ramped up from a small proportion of users to a larger population based on real-time evaluation of the performance of important metrics. If there are any regression detected during the rampup step, the ramp-up process stops and the feature developer is alerted. There are two main algorithm components powering this framework: 1) a continuous monitoring algorithm -using a variant of the sequential probability ratio test (SPRT) to monitor the feature performance metrics and alert feature developers when a metric degradation is detected, 2) an automated ramp-up algorithm -deciding when and how to ramp up to the next stage with larger sample size. This paper presents one monitoring algorithm and three ramping up algorithms including time-based, power-based, and riskbased (a Bayesian approach) schedules. These algorithms are evaluated and compared on both simulated data and real data. There are three benefits provided by this framework for feature rollout: 1) for defective features, it can detect the regression early and reduce negative effect, 2) for healthy features, it rolls out the feature quickly, 3) it reduces the need for manual intervention via the automation of the feature rollout process.

show abstract

Data-Driven Metric Development for Online Controlled Experiments

Cited by 66 publications

References 28 publications

Meta-evaluation of Online and Offline Web Search Evaluation Metrics

Meta-evaluation of Online and Offline Web Search Evaluation Metrics

Applying the Delta Method in Metric Analytics

Safely and Quickly Deploying New Features with a Staged Rollout Framework Using Sequential Test and Adaptive Experimental Design

Contact Info

Product

Resources

About