We investigate the extent to which language versions (English, French and Arabic) of the same science test are comparable in terms of item difficulty and demands. We argue that language is an inextricable part of the scientific literacy construct, be it intended or not by the examiner. This argument has considerable implications on methodologies used to address the equivalence of multiple language versions of the same assessment, including in the context of international assessment where cross-cultural fairness is a concern. We also argue that none of the available statistical or qualitative techniques are capable of teasing out the language variable and neutralising its potential effects on item difficulty and demands. Exploring the use of automated text analysis tools at the quality control stage may be successful in addressing some of these challenges.
In large-scale educational assessments, it is generally required that tests are composed of items that function invariantly across the groups to be compared. Despite efforts to ensure invariance in the item construction phase, for a range of reasons (including the security of items) it is often necessary to account for differential item functioning (DIF) of items post hoc. This typically requires a choice among retaining an item as it is despite its DIF, deleting the item, or resolving (splitting) an item by creating a distinct item for each group. These options involve a trade-off between model fit and the invariance of item parameters, and each option could be valid depending on whether or not the source of DIF is relevant or irrelevant to the variable being assessed. We argue that making a choice requires a careful analysis of statistical DIF and its substantive source. We illustrate our argument by analyzing PISA 2006 science data of three countries (UK, France and Jordan) using the Rasch model, which was the model used for the analyses of all PISA 2006 data. We identify items with real DIF across countries and examine the implications for model fit, invariance, and the validity of crosscountry comparisons when these items are either eliminated, resolved or retained.
Predicting item difficulty is highly important in education for both teachers and item writers. Despite identifying a large number of explanatory variables, predicting item difficulty remains a challenge in educational assessment with empirical attempts rarely exceeding 25% of variance explained. This paper analyses 216 science items of key stage 2 tests which are national sampling assessments administered to eleven year olds in England. Potential predictors (topic, subtopic, concept, question type, nature of stimulus, depth of knowledge and linguistic variables) were considered in the analysis. Coding frameworks employed in similar studies were adapted and employed by two coders to independently rate items. Linguistic demands were gauged using a computational linguistic facility. The stepwise regression models predicted 23% of the variance with extended constructed questions and photos being the main predictors of item difficulty.While a substantial part of unexplained variance could be attributed to the unpredictable interaction of variables, we argue that progress in this area requires improvement in the theories and the methods employed. Future research needs to be centred on improving coding frameworks as well as developing systematic training protocols for coders. These technical advances would pave the way to improved task design and reduced development costs of assessments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.