Trustworthy online controlled experiments

Kohavi, Ron; Deng, Alex; Frasca, Brian; Longbotham, Roger; Walker, Toby; Xu, Ya

doi:10.1145/2339530.2339653

Cited by 184 publications

(71 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data that is excellent but incomplete may provide insights under one marketing channel but would fail to inform the retailer of the total effect of a marketing action. Finally, data that is big data but does not contain exogenous sources of variation can be misleading to the retailer and suggests why experimental methods (A/B tests, e.g., Kohavi et al 2012) and/or instrumental variables methods (Conley et al 2008) have become popular tools to "learn from data". Next, we describe more relevant data.…”

Section: Big Data Versus Better Data and "Better" Modelsmentioning

confidence: 99%

The Role of Big Data and Predictive Analytics in Retailing

Bradlow

Gangwar

Kopalle

et al. 2017

Journal of Retailing

353

181

View full text Add to dashboard Cite

The paper examines the opportunities in and possibilities arising from Big Data in retailing, particularly along five major data dimensions -data pertaining to customers, products, time, (geo-spatial) location and channel. Much of the increase in data quality and application possibilities comes from a mix of new data sources, a smart application of statistical tools and domain knowledge combined with theoretical insights. The importance of theory in guiding any systematic search for answers to retailing questions, as well as for streamlining analysis remains undiminished, even as the role of Big Data and predictive analytics in retailing is set to rise in importance, aided by newer sources of data and large-scale correlational techniques. The Statistical issues discussed include a particular focus on the relevance and uses of Bayesian analysis techniques (data borrowing, updating, augmentation and hierarchical modeling), predictive analytics using big data and a field experiment, all in a retailing context. Finally, the ethical and privacy issues that may arise from the use of big data in retailing are also highlighted.2

show abstract

Section: Big Data Versus Better Data and "Better" Modelsmentioning

confidence: 99%

The Role of Big Data and Predictive Analytics in Retailing

Bradlow

Gangwar

Kopalle

et al. 2017

Journal of Retailing

353

181

View full text Add to dashboard Cite

show abstract

“…For example, "Profit" is not a good OEC, as shortterm theatrics (e.g., raising prices) can increase short-term profit, but hurt it in the long run. As we showed in Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained [25], market share can be a long-term goal, but it is a terrible short-term criterion: making a search engine worse forces people to issue more queries to find an answer, but, like hiking prices, users will find better alternatives long-term. Sessions per user, or repeat visits, is a much better factor in the OEC, and one that we use at Bing.…”

Section: Tenetmentioning

confidence: 96%

“…To address the multiple outcomes issue, we standardized our success criteria to use a small set of metrics, such as sessions/user [25].…”

Section: False Positivesmentioning

confidence: 99%

Online controlled experiments at large scale

Kohavi

Deng

Frasca

et al. 2013

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Self Cite

365

237

View full text Add to dashboard Cite

Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft's Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural/organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are billions of live variants of the site, so alerts are used to identify issues rather than relying on heavy upfront testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals more than $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders' early excitement, saving us similar large amounts.

show abstract

“…Namely, an experimental platform usually has a standardized success criteria, which utilizes a small set of key metrics to make the final decision on the treatment variant of the service [15]. These metrics are usually selected with respect to some business-related criteria of a considered service and are aligned with its long-term goals (like the number of sessions per user for a search engine [14]). Hence, finding an alternative for them is non-trivial and challenging [4], that is why a modification of an existing standardized metric is preferred.…”

Section: Introductionmentioning

confidence: 99%

“…User engagement reflects how often the user solves her needs (e.g., to search something) by means of the considered service (e.g., a search engine). On the one hand, these metrics are measurable in the short-term experiment period, and, on the other hand, they are predictive of the long-term success of the company [14,15,16,25]. That is why engagement metrics are often considered to be most appropriate for online evaluation.…”

Section: Introductionmentioning

confidence: 99%

Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments

Drutsa

Gusev

Serdyukov

2015

Proceedings of the 24th International Conference on World Wide Web

View full text Add to dashboard Cite

Modern Internet companies improve their services by means of data-driven decisions that are based on online controlled experiments (also known as A/B tests). To run more online controlled experiments and to get statistically significant results faster are the emerging needs for these companies. The main way to achieve these goals is to improve the sensitivity of A/B experiments. We propose a novel approach to improve the sensitivity of user engagement metrics (that are widely used in A/B tests) by utilizing prediction of the future behavior of an individual user. This problem of prediction of the exact value of a user engagement metric is also novel and is studied in our work. We demonstrate the effectiveness of our sensitivity improvement approach on several real online experiments run at Yandex. Especially, we show how it can be used to detect the treatment effect of an A/B test faster with the same level of statistical significance.

show abstract

Trustworthy online controlled experiments

Cited by 184 publications

References 5 publications

The Role of Big Data and Predictive Analytics in Retailing

The Role of Big Data and Predictive Analytics in Retailing

Online controlled experiments at large scale

Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments

Contact Info

Product

Resources

About