“…As a general choice for the RL algorithm in Fig. 1D, we consider a hybrid of model-based and model-free policy 19,32,52,53 . The model-free (MF) component uses the sequence of states s 1: t , actions a 1: t , extrinsic rewards r ext,1: t , and intrinsic rewards r int,1: t (in the two parallel branches in Fig.…”