Demands for accountability have seen the implementation of large scale testing programs in Australia and internationally. There is, however, a growing body of evidence to show that externally imposed testing programs do not have a sustained impact on student achievement. It has been argued that teacher assessment is more effective in raising student achievement levels. However, it is also often argued that teacher assessments are less reliable than the results of testing programs. This paper presents a study in which teachers judged writing scripts using the process of pairwise comparison to generate a scale. The analysis showed high internal consistency of the teacher judgements. The scale locations from pairwise comparisons were highly correlated with scale estimates for the same students from a large-scale testing program. The results demonstrate it is possible to efficiently obtain highly reliable and valid teacher judgements using the process of pairwise comparison. Reliability indices are also provided for a series of small-scale assessments that used the same methodology in a range of other domains. The results support the findings of the main study. The article discusses the benefits of using the method to supplement and validate results from large-scale testing programs.
Andersen (1995, 2002) proves a theorem relating variances of parameter estimates from samples and subsamples and shows its use as an adjunct to standard statistical analyses. The authors show an application where the theorem is central to the hypothesis tested, namely, whether random guessing to multiple choice items affects their estimates in the Rasch model. Taking random guessing to be a function of the difficulty of an item relative to the proficiency of a person, the authors describe a method for creating a subsample of responses, which is least likely to be affected by guessing. Then using Andersen’s theorem, the authors assess the difference in difficulty estimates between responses from the whole sample and the subsample for each item. To demonstrate the effectiveness of the procedure, data are simulated according to a class of models in which random guessing is a function of the proficiency of a person relative to the difficulty of an item. The procedure is also applied to an empirical data set from Raven’s Advanced Progressive Matrices, with the results indicating that guessing is present in a substantial number of items. It is noted that one especially important application in which estimating the correct relative difficulty of items is required is where the items will form part of an item bank and where on subsequent occasions the items will be administered interactively. In this case, items too difficult for a person are not administered and therefore unlikely to attract random guessing.
Models of modern test theory imply statistical independence among responses, generally referred to as local independence. One violation of local independence occurs when the response to one item governs the response to a subsequent item. Expanding on a formulation of this kind of violation as a process in the dichotomous Rasch model, this article generalizes the dependence process to the case of the unidimensional, polytomous Rasch model. It then shows how the magnitude of this violation can be estimated as a change in the location of thresholds separating adjacent categories in the second item caused by the response dependence on the first. As in the dichotomous model, it is suggested that this index is relatively more tangible in interpretation than other indices of dependence that are either a weight in the interaction term in a model or a correlation coefficient. One function of this method of assessing dependence is likely to be in the development of tests and assessment formats where evidence of the magnitude of dependence of one item on another in a pilot study can be used as part of the evidence in deciding which items will be retained in a final version of a test or which formats might need to be reconstructed. A second function might be to identify the magnitude of response dependence that may then need to be taken into account in some other way, perhaps by applying a model that takes account of the dependence.
Rubrics for assessing student performance are often seen as providing rich information about complex skills. Despite their widespread usage, however, little empirical research has focused on whether it is possible for rubrics to validly meet their intended purposes. The authors examine a rubric used to assess students’ writing in a large-scale testing program. They present empirical evidence for the existence of a potentially widespread threat to the validity of rubric assessments that arose due to design features. In this research, an iterative tryout-redesign-tryout approach was adopted. The research casts doubt on whether rubrics with structurally aligned categories can validly assess complex skills. A solution is proposed that involves rethinking the structural design of the rubric to mitigate the threat to validity. Broader implications are discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.