Elicited imitation (EI) has been widely used to examine second language (L2) proficiency and development and was an especially popular method in the 1970s and early 1980s. However, as the field embraced more communicative approaches to both instruction and assessment, the use of EI diminished, and the construct-related validity of EI scores as a representation of language proficiency was called into question. Current uses of EI, while not discounting the importance of communicative activities and assessments, tend to focus on the importance of processing and automaticity. This study presents a systematic review of EI in an effort to clarify the construct and usefulness of EI tasks in L2 research.The review underwent two phases: a narrative review and a meta-analysis. We surveyed 76 theoretical and empirical studies from 1970 to 2014, to investigate the use of EI in particular with respect to the research/assessment context and task features. The results of the narrative review provided a theoretical basis for the meta-analysis. The meta-analysis utilized 24 independent effect sizes based on 1089 participants obtained from 21 studies. To investigate evidence of constructrelated validity for EI, we examined the following: (1) the ability of EI scores to distinguish speakers
This paper reports on a mixed-methods approach to evaluate rater performance on a local oral English proficiency test. Three types of reliability estimates were reported to examine rater performance from different perspectives. Quantitative results were also triangulated with qualitative rater comments to arrive at a more representative picture of rater performance and to inform rater training. Specifically, both quantitative (6338 valid rating scores) and qualitative data (506 sets of rater comments) were analyzed with respect to rater consistency, rater consensus, rater severity, rater interaction, and raters’ use of rating scale. While raters achieved overall satisfactory inter-rater reliability (r = .73), they differed in severity and achieved relatively low exact score agreement. Disagreement of rating scores was largely explained by two significant main effects: (1) examinees’ oral English proficiency level, that is, raters tend to agree more on higher score levels than on lower score levels; (2) raters’ differential severity due to raters’ varied perceptions of speech intelligibility toward Indian and low-proficient Chinese examinees. However, effect sizes of raters’ differential severity effect on overall rater agreement were rather small, suggesting that varied perceptions among trained raters of second language (L2) intelligibility, though possible, are not likely to have a large impact on the overall evaluation of oral English proficiency. In contrast, at the lower score levels, examinees’ varied language proficiency profiles generated difficulty for rater alignment. Rater disagreement at these levels accounted for most of the overall rater disagreement and thus should be focused on during rater training. Implication of this study is that interpretation of rater performance should not just focus on identifying interactions between raters’ and examinees’ linguistic background but also examine the impact of rater interactions across examinees’ language proficiency levels. Findings of this study also indicate effectiveness of triangulating different sources of data on rater performance using a mixed-methods approach, especially in local testing contexts.
This study examines the predictive validity of the TOEFL iBT with respect to academic achievement as measured by the first-year grade point average (GPA) of Chinese students at Purdue University, a large, public, Research I institution in Indiana, USA. Correlations between GPA, TOEFL iBT total and subsection scores were examined on 1990 mainland Chinese students enrolled across three academic years (N2011 = 740, N2012 = 554, N2013 = 696). Subsequently, cluster analyses on the three cohorts’ TOEFL subsection scores were conducted to determine whether different score profiles might help explain the correlational patterns found between TOEFL subscale scores and GPA across the three student cohorts. For the 2011 and 2012 cohorts, speaking and writing subscale scores were positively correlated with GPA; however, negative correlations were observed for listening and reading. In contrast, for the 2013 cohort, the writing, reading, and total subscale scores were positively correlated with GPA, and the negative correlations disappeared. Results of cluster analyses suggest that the negative correlations in the 2011 and 2012 cohorts were associated with a distinctive Reading/Listening versus Speaking/Writing discrepant score profile of a single Chinese subgroup. In 2013, this subgroup disappeared in the incoming class because of changes made to the University’s international undergraduate admissions policy. The uneven score profile has important implications for admissions policy, the provision of English language support, and broader effects on academic achievement.
This paper provides a narrative review of empirical research on the assessment of speaking proficiency published in selected journals in the field of language assessment. A total of 104 published articles on speaking assessment were collected and systematically analyzed within an argument-based validation framework (Chapelle et al., 2008). We examined how the published research is represented in the six inferences of this framework, the topics that were covered by each article, and the research methods that were employed in collecting the backings to support the assumptions underlying each inference. Our analysis results revealed that: (a) most of the collected articles could be categorized into the three inferences of evaluation, generalization, and explanation; (b) the topics most frequently explored by speaking assessment researchers included the constructs of speaking ability, rater effects, and factors that affect spoken performance, among others; (c) quantitative methods were more frequently employed to interrogate the inferences of evaluation and generalization whereas qualitative methods were more frequently utilized to investigate the explanation inference. The paper concludes with a discussion of the implications of this study in relation to gaining a more nuanced understanding of task-or domain-specific speaking abilities, understanding speaking assessment in classroom contexts, and strengthening the interfaces between speaking assessment, and teaching and learning practices.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.