“…Evaluation metrics: Recent transformerbased metrics utilize BERT-based models like BERTScore (Zhang et al, 2020) and Mover-Score (Zhao et al, 2019). Extensions include BARTScore (Yuan et al, 2021), which reads off probability estimates as metric scores directly from text generation systems, and MENLI (Chen and Eger, 2023), which uses probabilities from models fine-tuned on Natural Language Inference task. These metrics are reference-based (comparing the MT output to a human reference), like BERTScore and MoverScore, or reference-free (comparing the MT output to the source text), like XMover-Score (Zhao et al, 2020) and SentSim (Song et al, 2021), and some are trained (fine-tuned on human scores) like COMET (Rei et al, 2020) while others are untrained, like BERTScore.…”