Evaluation Gaps in Machine Learning Practice

Hutchinson, Ben; Rostamzadeh, Negar; Greer, Christina; Heller, Katherine; Prabhakaran, Vinodkumar

doi:10.1145/3531146.3533233

Cited by 20 publications

(13 citation statements)

References 97 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Example metrics that participants proposed included error rates for determinations that a piece of evidence was inconclusive and ranges for the possible values that false positive software outputs could take on (Section 5.2.3). Together, these findings echo gaps between evaluation design and real-world use contexts highlighted in prior work (e.g., [42,72,81,92]), and, importantly, demonstrate the valuable insights that public defenders develop through their everyday encounters with CFS in the U.S. criminal legal system, further motivating growing efforts in HCI to engage downstream stakeholders in designing performance evaluations of AI systems (e.g., [28,55,76,81]).…”

Section: Contextualize Design Of Performance Evaluations In Real Worl...supporting

confidence: 68%

(Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in Court

Jin,

Salehi

2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Accountable use of AI systems in high-stakes settings relies on making systems contestable. In this paper we study efforts to contest AI systems in practice by studying how public defenders scrutinize AI in court. We present findings from interviews with 17 people in the U.S. public defense community to understand their perceptions of and experiences scrutinizing computational forensic software (CFS) -automated decision systems that the government uses to convict and incarcerate, such as facial recognition, gunshot detection, and probabilistic genotyping tools. We find that our participants faced challenges assessing and contesting CFS reliability due to difficulties (a) navigating how CFS is developed and used, (b) overcoming judges and jurors' non-critical perceptions of CFS, and (c) gathering CFS expertise. To conclude, we provide recommendations that center the technical, social, and institutional context to better position interventions such as performance evaluations to support contestability in practice.CCS Concepts: • Human-centered computing → Empirical studies in HCI.

show abstract

Section: Contextualize Design Of Performance Evaluations In Real Worl...supporting

confidence: 68%

(Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in Court

Jin,

Salehi

2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

show abstract

“…Metcalf et al [136], drawing from fields adjacent to ML, found that algorithmic impact assessments, intended to highlight risks of AI system deployment, can instead be co-opted by firms developing such systems to further their interests. Meanwhile, recent work has also explored and critiqued evaluation practices of AI systems more broadly, highlighting the implications of decontextualization when evaluating AI systems [103,135] and raising attention to the risks of corporate capture when private actors evaluate their own systems [219]. Our aim is to support improvements in the design and development of RAI tools through an analysis of existing evaluation practices for RAI tools.…”

Section: Evaluating Rai Toolsmentioning

confidence: 99%

“…In natural language processing (NLP), for instance, researchers have surveyed existing NLP model evaluation methods, finding no standardised evaluation practices [79,220]. Relatedly, efforts are underway to develop standards for evaluation of ML applications [88] and models [103]. Our focus complements these efforts, by attending to the evaluation of interventions in ML production, rather than the evaluation of the outputs of ML production, such as trained models or new AI systems.…”

Section: Evaluation Goals and Approaches Outside Of Hcimentioning

confidence: 99%

A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

Berman,

Goyal,

Madaio

2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Responsible design of AI systems is a shared goal across HCI and AI communities. Responsible AI (RAI) tools have been developed to support practitioners to identify, assess, and mitigate ethical issues during AI development. These tools take many forms (e.g., design playbooks, software toolkits, documentation protocols). However, research suggests that use of RAI tools is shaped by organizational contexts, raising questions about how effective such tools are in practice. To better understand how RAI tools are-and might beevaluated, we conducted a qualitative analysis of 37 publications that discuss evaluations of RAI tools. We find that most evaluations focus on usability, while questions of tools' effectiveness in changing AI development are sidelined. While usability evaluations are an important approach to evaluate RAI tools, we draw on evaluation approaches from other fields to highlight developer-and communitylevel steps to support evaluations of RAI tools' effectiveness in shaping AI development practices and outcomes.CCS Concepts: • General and reference → Evaluation; • Human-centered computing → HCI design and evaluation methods; • Software and its engineering → Designing software.

show abstract

“…can vary substantially across disciplines [34,77,98]. In practice, these properties lend themselves to communication breakdowns and ineffective collaboration around AI fairness [27,41,56,67,68]. Passi and Barocas found that misalignments around problem formulation between data scientists and business teams can contribute to fundamental fairness issues from the early problem formulation phases of a project [67].…”

Section: Background and Related Workmentioning

confidence: 99%

“…For instance, abstraction has been highlighted as an important skill for collaborating and communicating in software engineering and data analysis in cross-functional teams [4,55,64]-although with the risk of losing the nuance of particular contexts [cf. 41,77]. However, in the context of collaboration on AI fairness work, these abstractions that were intended to facilitate conversations across roles often resulted in other team members not fully understanding and appreciating the labor hidden behind the efforts individuals invested in enabling the collaboration in AI fairness 4.3.…”

Section: Making Invisible Labor Visible and Valuablementioning

confidence: 99%

Investigating Practices and Opportunities for Cross-functional Collaboration around AI Fairness in Industry Practice

Deng

Yildirim

Chang

et al. 2023

2023 ACM Conference on Fairness, Accountability, and Transparency

View full text Add to dashboard Cite

An emerging body of research indicates that ineffective cross-functional collaboration -the interdisciplinary work done by industry practitioners across roles -represents a major barrier to addressing issues of fairness in AI design and development. In this research, we sought to better understand practitioners' current practices and tactics to enact cross-functional collaboration for AI fairness, in order to identify opportunities to support more effective collaboration. We conducted a series of interviews and design workshops with 23 industry practitioners spanning various roles from 17 companies. We found that practitioners engaged in bridging work to overcome frictions in understanding, contextualization, and evaluation around AI fairness across roles. In addition, in organizational contexts with a lack of resources and incentives for fairness work, practitioners often piggybacked on existing requirements (e.g., for privacy assessments) and AI development norms (e.g., the use of quantitative evaluation metrics), although they worry that these tactics may be fundamentally compromised. Finally, we draw attention to the invisible labor that practitioners take on as part of this bridging and piggybacking work to enact interdisciplinary collaboration for fairness. We close by discussing opportunities for both FAccT researchers and AI practitioners to better support crossfunctional collaboration for fairness in the design and development of AI systems.

show abstract

Evaluation Gaps in Machine Learning Practice

Cited by 20 publications

References 97 publications

(Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in Court

(Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in Court

A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

Investigating Practices and Opportunities for Cross-functional Collaboration around AI Fairness in Industry Practice

Contact Info

Product

Resources

About