“…Many benchmarks have focused on code generation in APIs. Benchmarks like DS-1000 (Lai et al, 2023), ARCADE (Yin et al, 2022), NumpyEval , and PandasEval (Jain et al, 2022) focus on data science APIs. Other benchmarks measure using broader APIs or general software engineering tasks, such as JuICe (Agashe et al, 2019), APIBench (Patil et al, 2023), RepoBench , ODEX (Wang et al, 2022b), SWE-Bench (Jimenez et al, 2023), GoogleCodeRepo (Shrivastava et al, 2023), RepoEval , and Cocomic-Data .…”