Abstract:Task-based measures that capture neurocognitive processes can help bridge the gap between brain and behavior. To transfer tasks to clinical application, reliability is a crucial benchmark because it poses an upper bound to potential correlations with other variables (e.g., symptom or brain data). However, the reliability of many task-readouts is low. In this study, we scrutinized the re-test reliability of a probabilistic reversal learning task (PRLT) that is frequently used to characterize cognitive flexibili… Show more
“…There are multiple factors that may have contributed to these findings. Firstly, these complex measures are interactions between multiple noisy task measures, which is known to lead to a larger overall measurement noise 50,51 . Secondly, we found that multiple model-derived predictors showed a high degree of co-linearity and thus directly affected how well the impact of these metrics could be measured when used in the same model (cf.…”
Rapid adaptation to sudden changes in the environment is a hallmark of human behaviour. Many computational, neuroimaging, and even clinical investigations, which capture this ability have relied on a behavioural paradigm known as the predictive-inference task. However, the psychometric quality of this task has never been examined, leaving unanswered whether it is indeed suited to capture behavioural variation on a within- and between-subject level. Using a large-scale test-retest design (N=330), we assessed the internal (internal consistency) and temporal (test-retest reliability) stability of the task’s relevant measures. We show that while the main measures capturing flexible adaptation yield good internal consistency and overall satisfying test-retest reliability, more complex markers of flexible behaviour lack convincing psychometric quality. Our findings have implications for the large corpus of previous studies using this task and provide clear guidance as to which measures should and should not be used in future studies.
“…There are multiple factors that may have contributed to these findings. Firstly, these complex measures are interactions between multiple noisy task measures, which is known to lead to a larger overall measurement noise 50,51 . Secondly, we found that multiple model-derived predictors showed a high degree of co-linearity and thus directly affected how well the impact of these metrics could be measured when used in the same model (cf.…”
Rapid adaptation to sudden changes in the environment is a hallmark of human behaviour. Many computational, neuroimaging, and even clinical investigations, which capture this ability have relied on a behavioural paradigm known as the predictive-inference task. However, the psychometric quality of this task has never been examined, leaving unanswered whether it is indeed suited to capture behavioural variation on a within- and between-subject level. Using a large-scale test-retest design (N=330), we assessed the internal (internal consistency) and temporal (test-retest reliability) stability of the task’s relevant measures. We show that while the main measures capturing flexible adaptation yield good internal consistency and overall satisfying test-retest reliability, more complex markers of flexible behaviour lack convincing psychometric quality. Our findings have implications for the large corpus of previous studies using this task and provide clear guidance as to which measures should and should not be used in future studies.
“…Furthermore, parameter generalizability is naturally bounded by parameter reliability, i.e., the stability of parameter estimates when participants perform the same task twice (test-retest reliability) or when estimating parameters from different subsets of the same dataset (split-half reliability). The reliability of RL models has recently become the focus of several parallel investigations [46, 47, 71, 48], some employing very similar tasks to ours [72]. The investigations collectively suggest that excellent reliability can often be achieved with the right methods, most notably by using hierarchical model fitting.…”
Section: Appendix 2-tablementioning
confidence: 82%
“…Lastly, model parameter reliability might play a crucial role for our results: If parameters lack consistency between two instantiations of the same task (reliability), generalization between different tasks would necessarily be low as well. A recent wave of research, however, has convincingly demonstrated that good reliability is possible for several common RL models [47, 71, 48, 72], and we employ the recommended methods here [61, 53]. In addition, our simulation analysis shows that our approach can detect generalization.…”
Section: Discussionmentioning
confidence: 95%
“…The investigations collectively suggest that excellent reliability can often be achieved with the right methods, most notably by using hierarchical model fitting. Reliability might still differ between tasks or models, potentially being lower for learning rates than other RL parameters [72], and differing between tasks (e.g., compare [46] to [47]). In this study, we used hierarchical fitting for tasks A and B and assessed a range of qualitative and quantitative measures of model fit for each task [53, 49, 61], boosting our confidence in high reliability of our parameter estimates, and the conclusion that the lack of between-task parameter correlations was not due to a lack of parameter reliability, but a lack of generalizability.…”
Reinforcement Learning (RL) has revolutionized the cognitive and brain sciences, explaining behavior from simple conditioning to problem solving, across the life span, and anchored in brain function. However, discrepancies in results are increasingly apparent between studies, particularly in the developmental literature. To better understand these, we investigated to which extent parameters generalize between tasks and models, and capture specific and uniquely interpretable (neuro)cognitive processes. 291 participants aged 8-30 years completed three learning tasks in a single session, and were fitted using state-of-the-art RL models. RL decision noise/exploration parameters generalized well between tasks, decreasing between ages 8-17. Learning rates for negative feedback did not generalize, and learning rates for positive feedback showed intermediate generalizability, dependent on task similarity. These findings can explain discrepancies in the existing literature. Future research therefore needs to carefully consider task characteristics when relating findings across studies, and develop strategies to computationally model how context impacts behavior.
“…approach in which sessions are modeled jointly. The latter has recently been shown to yield superior reliability estimates in theory and practice in other cognitive tasks (for details, see below; Brown, 2020;Haines, 2021;Waltmann et al, 2021).…”
Self-regulation, the ability to guide behavior according to one’s goals, plays an integral role in understanding loss of control behaviors a pertinent example being substance use disorders (SUD). Yet, experimental tasks that measure processes underlying self-regulation are not easy to deploy in contexts where such behaviors often occur, namely in real life situations outside the laboratory. Moreover, lab-based experimental tasks are criticized for poor test–retest reliability and a lack of construct validity. These concerns might in part explain why ecological validity of experimental measures—their ability to predict real-life behavior—is low. To address these shortcomings, we assessed the reliability and construct validity of four smartphone-based experimental tasks designed to measure cognitive control and decision-making. To facilitate future clinical applicability we recruited a large (N=488) sample of individuals with SUD. Joint modeling of measurement sessions increased the reliability of task measures from moderate to good and often excellent levels. In line with theories of cognitive control and motivation, three latent factors reflecting cognitive control and decision-making in the context of gains and losses best described the data. As proof of concept, we show that a latent cognitive control score based on joint modeling, yielded stronger correlations with drinking behavior than single task scores based on separate modeling. These findings indicate that in individuals with SUD, smartphone-based ambulatory experimental assessments can reliably index functions of cognitive control and decision-making, with plausible construct validity. Our findings provide evidence for rich possibilities arising from longitudinal experimental studies in SUD as well as in psychiatry, neuroscience, and psychology more generally.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.