This article develops a validity argument approach for use on observation protocols currently used to assess teacher quality for high-stakes personnel and professional development decisions. After defining the teaching quality domain, we articulate an interpretive argument for observation protocols. To illustrate the types of evidence that might compose a validity argument, we draw on data from a validity study of the Classroom Assessment Scoring System for secondary classrooms. Based on data from 82 Algebra classrooms, we illustrate how data from observation scores, valueadded models, generalizability studies, and measures of teacher knowledge, student achievement, and teacher and student beliefs could be used to build a validity argument for observation protocols. Strengths and limitations of the validity argument approach as well as the issues the approach raises for observation protocol validity research are considered.Recent federal legislation has put states and districts under unprecedented pressure to improve teaching quality through evaluation (United States Department of Education, 2009; United States Department of Education Office of Planning Evaluation and Policy Development, 2010).
Computer simulation methods were used to examine the sensitivity of model fit criteria to misspecification of the first-level error structure in two-level models of change, and then to examine the impact of misspecification on estimates of the variance parameters, estimates of the fixed effects, and tests of the fixed effects. Fit criteria frequently failed to identify the correct model when series lengths were short. Misspecification led to substantially biased estimates of variance parameters. The estimates of the fixed effects, however, remained unbiased for most conditions, and the tests of fixed effects were robust to misspecification for most conditions. The problems in the fixed effects occurred when nonlinear growth trajectories were coupled with data that were unequally spaced by different amounts for different individuals.
The purpose of this study was to compare and evaluate five on‐line pretest item‐calibration/scaling methods in computerized adaptive testing (CAT): marginal maximum likelihood estimate with one EM cycle (OEM), marginal maximum likelihood estimate with multiple EM cycles (MEM), Stocking's Method A, Stocking's Method B, and BILOG/Prior. The five methods were evaluated in terms of item‐parameter recovery, using three different sample sizes (300, 1000 and 3000). The MEM method appeared to be the best choice among these, because it produced the smallest parameter‐estimation errors for all sample size conditions. MEM and OEM are mathematically similar, although the OEM method produced larger errors. MEM also was preferable to OEM, unless the amount of time involved in iterative computation is a concern. Stocking's Method B also worked very well, but it required anchor items that either would increase test lengths or require larger sample sizes depending on test administration design. Until more appropriate ways of handling sparse data are devised, the BILOG/Prior method may not be a reasonable choice for small sample sizes. Stocking's Method A had the largest weighted total error, as well as a theoretical weakness (i.e., treating estimated ability as true ability); thus, there appeared to be little reason to use it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.