“…However, recent research has highlighted the limitations of subword-level tokenization, including poor generaliza- tion for out-of-vocabulary words and domains due to their reliance on a fixed vocabulary (Bostrom & Durrett, 2020;Klein & Tsarfaty, 2020;Hofmann et al, 2021;Dong et al, 2020;Xu et al, 2021). This limitation is particularly problematic for forensic NLP models used to detect covert criminal communications (CCC) that employ unusual characters and subwords for obfuscation (Bromberg et al, 2020;Pei & Cheng, 2022;Tong et al, 2017;Wagner et al, 2020;Wang et al, 2019;Zhu et al, 2019).…”