“…One of the most promising tools for AI alignment, Reinforcement Learning with Human Feedback (RLHF, or Preference-based Reinforcement Learning), has delivered significant empirical success in the fields of game playing, robot training, stock-prediction, recommender systems, clinical trials, large language models etc. (Novoseller et al, 2019;Sadigh et al, 2017;Christiano et al, 2017b;Kupcsik et al, 2018;Jain et al, 2013;Wirth et al, 2017;Knox and Stone, 2008;MacGlashan et al, 2017;Christiano et al, 2017a;Warnell et al, 2018;Brown et al, 2019;Shin et al, 2023;Ziegler et al, 2019;Stiennon et al, 2020;Nakano et al, 2021;Ouyang et al, 2022;Menick et al, 2022;Glaese et al, 2022;Gao et al, 2022;Bai et al, 2022a;Ganguli et al, 2022;Ramamurthy et al, 2022). Notably, the language model application ChatGPT is based on RLHF and this underlies several of its skills: answering followup questions, admitting its mistakes, challenging incorrect premises, and rejecting inappropriate requests.…”