Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically significant effect ( p < .05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δ r = .002 or .014, depending on analytic approach). The median effect size for the revised protocols ( r = .05) was similar to that of the RP:P protocols ( r = .04) and the original RP:P replications ( r = .11), and smaller than that of the original studies ( r = .37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = .07, range = .00–.15) were 78% smaller, on average, than the original effect sizes (median r = .37, range = .19–.50).
Replication efforts in psychological science sometimes fail to replicate prior findings. If replications use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the replication protocol rather than a challenge to the original finding. Formal pre-data collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replications from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) in which the original authors had expressed concerns about the replication designs before data collection and only one of which was “statistically significant” (p < .05). Commenters on RP:P suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these failed to replicate (Gilbert et al., 2016). We revised the replication protocols and received formal peer review prior to conducting new replications. We administered the RP:P and Revised replication protocols in multiple laboratories (Median number of laboratories per original study = XX; Range XX to YY; Median total sample = XX; Range XX to YY) for high-powered tests of each original finding with both protocols. Overall, XX of 10 RP:P protocols and XX of 10 Revised protocols showed significant evidence in the same direction as the original finding (p < .05), compared to an expected XX. The median effect size was [larger/smaller/similar] for Revised protocols (ES = .XX) compared to RP:P protocols (ES = .XX), and [larger/smaller/similar] compared to the original studies (ES = .XX) and [larger/smaller/similar] compared to the original RP:P replications (ES = .XX). Overall, Revised protocols produced [much larger/somewhat larger/similar] effect sizes compared to RP:P protocols (ES = .XX). We also elicited peer beliefs about the replications through prediction markets and surveys of a group of researchers in psychology. The peer researchers predicted that the Revised protocols would [decrease/not affect/increase] the replication rate, [consistent with/not consistent with] the observed replication results. The results suggest that the lack of replicability of these findings observed in RP:P was [partly/completely/not] due to discrepancies in the RP:P protocols that could be resolved with expert peer review.
O lançamento, em 2013, de duas releituras do cinema de terror de baixo orçamento (A Morte do Demônio e O Massacre da Serra Elétrica) nos coloca diante de uma questão ligada à construção do valor no cinema. Discute-se aqui como o gênero terror foi construindo a ideia de que filmes de baixo orçamento endossam uma premissa de medo. Debatemos as releituras dos dois filmes percebendo a “cosmética do sangue”, ou seja, a tentativa de emular o baixo orçamento em produções com aviltantes recursos.
Risen and Gilovich (2008) found that subjects believed that “tempting fate” would be punished with ironic bad outcomes (a main effect), and that this effect was magnified when subjects were under cognitive load (an interaction). A previous replication study (Frank & Mathur, 2016) that used an online implementation of the protocol on Amazon Mechanical Turk failed to replicate both the main effect and the interaction. Before this replication was run, the authors of the original study expressed concern that the cognitive-load manipulation may be less effective when implemented online than when implemented in the lab and that subjects recruited online may also respond differently to the specific experimental scenario chosen for the replication. A later, large replication project, Many Labs 2 (Klein et al. 2018), replicated the main effect (though the effect size was smaller than in the original study), but the interaction was not assessed. Attempting to replicate the interaction while addressing the original authors’ concerns regarding the protocol for the first replication study, we developed a new protocol in collaboration with the original authors. We used four university sites ( N = 754) chosen for similarity to the site of the original study to conduct a high-powered, preregistered replication focused primarily on the interaction effect. Results from these sites did not support the interaction or the main effect and were comparable to results obtained at six additional universities that were less similar to the original site. Post hoc analyses did not provide strong evidence for statistical inconsistency between the original study’s estimates and our estimates; that is, the original study’s results would not have been extremely unlikely in the estimated distribution of population effects in our sites. We also collected data from a new Mechanical Turk sample under the first replication study’s protocol, and results were not meaningfully different from those obtained with the new protocol at universities similar to the original site. Secondary analyses failed to support proposed substantive mechanisms for the failure to replicate.
Background Test anxiety is a crucial factor in determining academic outcomes, and it may lead to poor cognitive performance, academic underachievement, and psychological distress, interfering specifically with their ability to think and perform during tests. The main objective of this study was to explore the applicability and psychometric properties of a Portuguese version of the Reactions to Tests scale (RTT) in a sample of medical students. Method A sample of 672 medical students completed the RTT. The sample was randomly split in half to allow for independent Exploratory Factor Analysis (EFA) and to test the best fit model—Confirmatory Factor Analysis (CFA). CFA was used to test both the first-order factor structure (four subscales) and second-order factor structure, in which the four subscales relate to a general factor, Test Anxiety. The internal consistency of the RTT was assessed through Cronbach’s alpha, Composite reliability (CR) and Average Variance Extracted (AVE) for the total scale and each of the four subscales. Convergent validity was evaluated through the correlation between RTT and the State-Trait Anxiety Inventory (STAI-Y).To explore the comparability of measured attributes across subgroups of respondents, measurement invariance was also studied. Results Results from exploratory and confirmatory factor analyses showed acceptable fits for the Portuguese RTT version. Concerning internal consistency, results indicate that RTT was found to be reliable to measure test anxiety in this sample. Convergent validity of the RTT with both state and trait anxiety STAI-Y’s subscales was also shown. Moreover, multigroup analyses showed metric invariance across gender and curriculum phase. Conclusion Our results suggest that the RTT scale is a valid and reliable instrument for the measurement of test anxiety among Portuguese Medical Students.
Background Current demand for multiple-choice questions (MCQs) in medical assessment is greater than the supply. Consequently, an urgency for new item development methods arises. Automatic Item Generation (AIG) promises to overcome this burden, generating calibrated items based on the work of computer algorithms. Despite the promising scenario, there is still no evidence to encourage a general application of AIG in medical assessment. It is therefore important to evaluate AIG regarding its feasibility, validity and item quality. Objective Provide a narrative review regarding the feasibility, validity and item quality of AIG in medical assessment. Methods Electronic databases were searched for peer-reviewed, English language articles published between 2000 and 2021 by means of the terms ‘Automatic Item Generation’, ‘Automated Item Generation’, ‘AIG’, ‘medical assessment’ and ‘medical education’. Reviewers screened 119 records and 13 full texts were checked according to the inclusion criteria. A validity framework was implemented in the included studies to draw conclusions regarding the validity of AIG. Results A total of 10 articles were included in the review. Synthesized data suggests that AIG is a valid and feasible method capable of generating high-quality items. Conclusions AIG can solve current problems related to item development. It reveals itself as an auspicious next-generation technique for the future of medical assessment, promising several quality items both quickly and economically.
Progress tests (PT) are a popular type of longitudinal assessment used for evaluating clinical knowledge retention and long-life learning in health professions education. Most PTs consist of multiple-choice questions (MCQs) whose development is costly and time-consuming. Automatic Item Generation (AIG) generates test items through algorithms, promising to ease this burden. However, it remains unclear how AIG-items behave in formative assessment (FA) modalities such as PTs compared to manually written items. The purpose of this study was to compare the quality and validity of AIG-items versus manually written items. Responses to 126 (23 automatically generated) dichotomously scored single best-answer five-option MCQs retrieved from the 2021 University of Minho PT of medicine were analyzed. Procedures based on item response theory (IRT), dimensionality testing, item fit, reliability, differential item functioning (DIF) and distractor analysis were used. Qualitative assessment was conducted through expert review. Validity evidence of AIG-items was assessed by using hierarchical linear modeling (HLM). The PT proved to be a viable tool for assessing medical students cognitive competencies. AIG-items were parallel to manually written-items, presenting similar indices of difficulty and information. The proportion of functional distractors for both AIG and manually written items was similar. Evidence of validity for AIG-items was found while showing higher levels of item quality. AIG-items functioned as intended and were appropriate for evaluating medical students at various levels of the knowledge spectrum.
Background and objective Recent developments in Europe and Portugal provide a fertile ground for the rise of populism. Despite the growing interest in the topic, there is no reliable tool to gauge Portuguese citizens’ populist attitudes to date. The Populist Attitudes Scale (POP-AS), developed by Akkerman et al. [1], is one of the best-known instruments for measuring populist attitudes. However, no version for use in the Portuguese population is available. This paper describes the psychometric validation of the POP-AS for the Portuguese population. Methods Trustworthy measures of validity suggested by Boateng et al. [2] to address the psychometric features of the POP-AS were approached. A robust psychometrical pipeline evaluated the reliability, construct validity, cross national/educational validity, and internal validity of the POP-AS. Results The Portuguese version of the POP-AS exhibited sound internal consistency and demonstrated adequate properties of validity: a one-factor model was obtained, revealing evidence of construct validity; invariance was ensured for education and partially ensured for the country; All the items of the POP-AS revealed relatively good values of discrimination and contributed adequately to the total score of the scale, ensuring evidence of internal validity. Conclusion Psychometric analysis supports the POP-AS as a valid and reliable instrument for measuring populist attitudes among Portuguese citizens. A validation framework for measurement instruments in political science was proposed. Implications of the findings are discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.