This is the first attempt to evaluate the use of comprehensive workplace assessment across the medical specialties in the UK. The methods are feasible to conduct and can make reliable distinctions between doctors' performances. With adaptation, they may be appropriate for assessing the workplace performance of other grades and specialties of doctor. This may be helpful in informing foundation assessment.
CONTEXT Assessment in the workplace is important, but many evaluations have shown that assessor agreement and discrimination are poor. Training discussions suggest that assessors find conventional scales invalid. We evaluate scales constructed to reflect developing clinical sophistication and independence in parallel with conventional scales.METHODS A valid scale should reduce assessor disagreement and increase assessor discrimination. We compare conventional and construct-aligned scales used in parallel to assess approximately 2000 medical trainees by each of three methods of workplace-based assessment (WBA): the mini-clinical evaluation exercise (mini-CEX); the acute care assessment tool (ACAT), and the case-based discussion (CBD). We evaluate how scores reflect assessor disagreement (V j and V j*p ) and assessor discrimination (V p ), and we model reliability using generalisability theory.RESULTS In all three cases the conventional scale gave a performance similar to that in previous evaluations, but the construct-aligned scales substantially reduced assessor disagreement and substantially increased assessor discrimination. Reliability modelling shows that, using the new scales, the number of assessors required to achieve a generalisability coefficient ‡ 0.70 fell from six to three for the mini-CEX, from eight to three for the CBD, from 10 to nine for 'on-take' ACAT, and from 30 to 12 for 'post-take' ACAT.
CONCLUSIONSThe results indicate that construct-aligned scales have greater utility, both because they are more reliable and because that reliability provides evidence of greater validity. There is also a wider implication: the disappointing reliability of existing WBA methods may reflect not assessors' differing assessments of performance, but, rather, different interpretations of poorly aligned scales. Scales aligned to the expertise of clinician-assessors and the developing independence of trainees may improve confidence in WBA.assessment
Four general principles emerge: the response scale should be aligned to the reality map of the judges; judgements rather than objective observations should be sought; the assessment should focus on competencies that are central to the activity observed, and the assessors who are best-placed to judge performance should be asked to participate.
The MMI is a moderately reliable method of assessment. The largest source of error relates to aspects of interviewer subjectivity, suggesting interviewer training would be beneficial. Candidate performance on 1 question does not correlate strongly with performance on another question, demonstrating the importance of context specificity. The MMI needs to be sufficiently long for precise comparison for ranking purposes. We supported the validity of the MMI by showing a small positive correlation with GAMSAT section scores.
A G-study uses variance component analysis to measure the contributions that all relevant factors make to the result (observer, situation, case, assessee and their interactions). This information can be combined to reflect the reliability of a single observation as a reflection of all possible measurements - a true reflection of reliability. It can also be used to estimate the reliability of a combined sample of several different observations, or to predict how many observations are required with different test formats to achieve a given level of reliability. Worked examples are used to illustrate the concepts.
It is essential to clarify the purpose of the assessment in question because this drives every aspect of its design. The intended focus for the assessment should be defined as specifically as possible. The scope of situations over which the result is intended to generalize should be established. Blueprinting may help the test designer to select a representative sample of practice across all the relevant aspects of performance and may also be used to inform the selection of appropriate assessment methods. An appropriately designed pilot study enables the test designer to evaluate feasibility, acceptability, validity (with respect to the intended focus) and reliability (with respect to the intended scope of generalization).
PBA demonstrated good overall validity and acceptability, and exceptionally high reliability. Trainees should be assessed adequately for each given procedure.
The Mini-CEX in anaesthesia has strengths and weaknesses. Strengths include: its perceived very positive educational impact and its relative feasibility. Variable assessor stringency means that large numbers of assessors are required to produce reliable scores.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.