“…Several works created synthetic datasets to evaluate gender bias (Kiritchenko and Mohammad, 2018;González et al, 2020;Renduchintala and Williams, 2021), e.g., in the context of coreference (Rudinger et al, 2017;Zhao et al, 2018) and machine translation (Stanovsky et al, 2019;Prates et al, 2019;Kocmi et al, 2020), and some works used synthetic datasets to debias models (Saunders et al, 2020;Zhao et al, 2018). Webster et al (2018) and Gonen and Webster (2020), collected natural medium-scale (4.4K sentences) datasets from Wikipedia and reddit, re-spectively, and use them to evaluate gender bias in models of coreference resolution and machine translation.…”