Tests for experiments with matched groups or repeated measures designs use error terms that involve the correlation between the measures as well as the variance of the data. The larger the correlation between the measures, the smaller the error and the larger the test statistic. If an effect size is computed from the test statistic without taking the correlation between the measures into account, effect size will be overestimated. Procedures for computing effect size appropriately from matched groups or repeated measures designs are discussed.
The concomitant proliferation of causal modeling and hypotheses of multiplicative effects has brought about a tremendous need for procedures that allow the testing of moderated structural equation models (MSEMsAs the social sciences have developed, the complexity of hypothesized relationships has increased steadily (Cortina, 1993). Two of the more obvious indicators of this complexity are the increasing frequency of hypotheses involving multiplicative effects (e.g., linear interaction effects, nonlinear effects) and the popularity of structural equations modeling (SEM). In spite of the preponderance of both multiplicative effects and structural equations models, there is considerable confusion about the appropriate methods for combining the two. In other words, there is confusion with respect to the manner in which multiplicative effects should be incorporated into covariance structures models (Hayduk, 1987;Mathieu, Tannenbaum, & Salas, 1992;Ping, 1995).Strangely, this confusion is not due to a lack of methodology. There are a variety of techniques available for testing structural equations models with multiplicative terms (moderated structural equations models [MSEMs]), each with its own strengths and weaknesses. Nevertheless, most of these techniques are unknown outside mathemati-
The authors present guidelines for establishing a useful range for interrater agreement and a cutoff for acceptable interrater agreement when using Burke, Finkelstein, and Dusig’s average deviation (AD) index as well as critical values for tests of statistical significance with the AD index. Under the assumption that judges respond randomly to an item or set of items in a measure, the authors show that a criterion for acceptable interrater agreement or practical significance when using the AD index can be approximated as c/6, where c is the number of response options for a Likert-type item. The resulting values of 0.8, 1.2, 1.5, and 1.8 are discussed as standards for acceptable interrater agreement when using the AD index with 5-, 7-, 9-, and 11-point items, respectively. Using similar logic, the AD agreement index and interpretive standard are generalized to the case of a response scale that involves percentages or proportions, rather than discrete categories, or at the other extreme, the assessment of interrater agreement with respect to the rating of a single target on a dichotomous item (e.g., yes-no, agree-disagree, true-false item formats). Finally, the usefulness of these guidelines for judging acceptable levels of interrater agreement with respect to the metric (or units) of the original response scale is discussed.
The authors demonstrated that the most common statistical significance test used with r(WG)-type interrater agreement indexes in applied psychology, based on the chi-square distribution, is flawed and inaccurate. The chi-square test is shown to be extremely conservative even for modest, standard significance levels (e.g., .05). The authors present an alternative statistical significance test, based on Monte Carlo procedures, that produces the equivalent of an approximate randomization test for the null hypothesis that the actual distribution of responding is rectangular and demonstrate its superiority to the chi-square test. Finally, the authors provide tables of critical values and offer downloadable software to implement the approximate randomization test for r(WG)-type and for average deviation (AD)-type interrater agreement indexes. The implications of these results for studying a broad range of interrater agreement problems in applied psychology are discussed.
Experiments that find larger differenoes between groups than mtually exist in the population are more likely to paas stringent tests of sigdioanoe and be published than experiments that find smaller differences. Published meaaures of the magnitude of experimental effeots will therefore tend to overestimate these effects. This bias waa investigated aa a funotion of sample size, actual population difference, and alpha level. The overestimation of experimental effeots waa found to be quite large with the oommonly employed significance levels of 6 per cent and 1 per oent. Further, the recently recommended meaaure, w*, waa found to depend much more heavily on the alpha level employed than the true population wa value. Hence, it waa ooncluded that effeat size estimation is impractical unless soientific journals drop the oonsideration of statistical SignificaJlOe aa one of the oriteria of publioation.
Although simulator sickness is known to increase with protracted exposure and to diminish with repeated sessions, limited systematic research has been performed in these areas. This study reviewed the few studies with sufficient information available to determine the effect that exposure duration and repeated exposure have on motion sickness. This evaluation confirmed that longer exposures produce more symptoms and that total sickness subsides over repeated exposures. Additional evaluation was performed to investigate the precise form of this relationship and to determine whether the same form was generalizable across varied simulator environments. The results indicated that exposure duration and repeated exposures are significantly linearly related to sickness outcomes (duration being positively related and repetition negatively related to total sickness). This was true over diverse systems and large subject pools. This result verified the generalizability of the relationships among sickness, exposure duration, and repeated exposures. Additional research is indicated to determine the optimal length of a single exposure and the optimal intersession interval to facilitate adaptation.
There has been much recent attention given to the problems involved with the traditional approach to null hypothesis significance testing (NHST). Many have suggested that, perhaps, NHST should be abandoned altogether in favor of other bases for conclusions such as confidence intervals and effect size estimates (e.g., Schmidt, 1996). The purposes of this article are to (a) review the function that data analysis is supposed to serve in the social sciences, (b) examine the ways in which these functions are performed by NHST, (c) examine the case against NHST, and (d) evaluate interval-based estimation as an alternative to NHST. The topic of this article is null hypothesis significance testing (NHST; Cohen, 1994). By this we mean the process, common to the behavioral sciences, of rejecting or suspending judgment on a given null hypothesis based on a priori theoretical considerations and p values in an attempt to draw conclusions with respect to an alternative hypothesis. We should begin by saying that we agree with J. Cohen, G. Gigerenzer, D. Bakan, W. Rozeboom, and so on with respect to the notion that the logic of NHST is widely misunderstood and that the conclusions drawn from such tests are often unfounded or at least exaggerated
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.