Multiple teams at Facebook are tasked with monitoring compute and memory utilization metrics that are important for managing the efficiency of the codebase. An efficiency regression is characterized by instances where the CPU utilization or query per second (QPS) patterns of a function or endpoint experience an unexpected increase over its prior baseline. If the code changes responsible for these regressions get propagated to Facebook's fleet of web servers, the impact of the inefficient code will get compounded over billions of executions per day, carrying potential ramifications to Facebook's scaling efforts and the quality of the user experience. With a codebase ingesting in excess of 1,000 diffs across multiple pushes per day, it is important to have a real-time solution for detecting regressions that is not only scalable and high in recall, but also highly precise in order to avoid overrunning the remediation queue with thousands of false positives. This paper describes the end-to-end regression detection system designed and used at Facebook. The main detection algorithm is based on sequential statistics supplemented by signal processing transformations, and the performance of the algorithm was assessed with a mixture of online and offline tests across different use cases. We compare the performance of our algorithm against a simple benchmark as well as a commercial anomaly detection software solution. CCS CONCEPTS • Computer systems organization → Maintainability and maintenance; Real-time systems; • Software and its engineering → Operational analysis; • Information systems → Decision support systems; • Computing methodologies → Anomaly detection; • Mathematics of computing → Time series analysis;
Moderating content in social media platforms is a formidable challenge due to the unprecedented scale of such systems, which typically handle billions of posts per day. Some of the largest platforms such as Facebook blend machine learning with manual review of platform content by thousands of reviewers. Operating a large-scale human review system poses interesting and challenging methodological questions that can be addressed with operations research techniques. We investigate the problem of optimally operating such a review system at scale using ideas from queueing theory and simulation. CCS CONCEPTS• Computing methodologies → Machine learning; • Information systems → Social networks; • Mathematics of computing → Queueing theory.
In today's digital world, interaction with online platforms is ubiquitous, and thus content moderation is important for protecting users from content that do not comply with pre-established community guidelines. Given the vast volume of content generated online daily, having a robust content moderation system throughout every stage of planning is particularly important. We study the short-term planning problem of allocating human content reviewers to different harmful content categories. We use tools from fair division and study the application of competitive equilibrium and leximin allocation rules for addressing this problem. On top of the traditional Fisher market setup, we additionally incorporate novel aspects that are of practical importance. The first aspect is the forecasted workload of different content categories, which puts constraints on the allocation chosen by the planner. We show how a formulation that is inspired by the celebrated Eisenberg-Gale program allows us to find an allocation that not only satisfies the forecasted workload, but also fairly allocates the remaining working hours from the content reviewers among all content categories. The resulting allocation is also robust in a sense that the additional allocation provides a guardrail in cases where the actual workload deviates from the predicted workload. The second practical consideration is time dependent allocation that is motivated by the fact that partners need scheduling guidance for the reviewers across days to achieve efficiency. To address the time component, we introduce new extensions of the various fair allocation approaches for the single-time period setting, and we show that many properties extend in essence, albeit with some modifications. Lastly, related to the time component, we additionally investigate how to satisfy markets' desire for smooth allocation -that is, partners for content reviewers prefer an allocation that does not vary much from time to time, so that the switch in staffing is minimized. We demonstrate the performance of our proposed approaches through real-world data obtained from Meta.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.