We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformerbased neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency. * Work done while at Google.
Expert reviews are frequently used as a questionnaire evaluation method but have received little empirical attention. Questions from two surveys are evaluated by six expert reviewers using a standardized evaluation form. Each of the questions has validation data available from records. Large inconsistencies in ratings across the six experts are found. Despite the lack of reliability, the average expert ratings successfully identify questions that had higher item nonresponse rates and higher levels of inaccurate reporting. This article provides empirical evidence that experts are able to discern questions that manifest data quality problems, even if individual experts vary in what they rate as being problematic. Compared to a publicly available computerized question evaluation tool, ratings by the human experts positively predict questions with data quality problems, whereas the computerized tool varies in success in identifying these questions. These results indicate that expert reviews have value in identifying question problems that result in lower survey data quality. 295
A common hypothesis about practices to reduce survey nonresponse is that those persons brought into the respondent pool through persuasive efforts may provide data filled with measurement error. Two questions flow from this hypothesis. First, does the mean square error of a statistic increase when sample persons who are less likely to be contacted or cooperate are incorporated into the respondent pool? Second, do nonresponse bias estimates made on the respondents, using survey reports instead of records, provide accurate information about nonresponse bias? Using a unique data set, the Wisconsin Divorce Study, with divorce records as the frame and questions about the frame information included in the questionnaire, this article takes a first look into these two issues. We find that the relationship between nonresponse bias, measurement error bias, and response propensity is statistic-specific and specific to the type of nonresponse. Total bias tends to be lower on estimates calculated using all respondents, compared with those with only the highest contact and cooperation propensities, and nonresponse bias analyses based on respondents yield conclusions similar to those based on records. Finally, we find that error properties of statistics may differ from error properties of the individual variables used to calculate the statistics.
Non-response weighting is a commonly used method to adjust for bias due to unit nonresponse in surveys. Theory and simulations show that, to reduce bias effectively without increasing variance, a covariate that is used for non-response weighting adjustment needs to be highly associated with both the response indicator and the survey outcome variable. In practice, these requirements pose a challenge that is often overlooked, because those covariates are often not observed or may not exist. Surveys have recently begun to collect supplementary data, such as interviewer observations and other proxy measures of key survey outcome variables. To the extent that these auxiliary variables are highly correlated with the actual outcomes, these variables are promising candidates for non-response adjustment. In the present study, we examine traditional covariates and new auxiliary variables for the National Survey of Family Growth, the Medical Expenditure Panel Survey, the American National Election Survey, the European Social 389 390 K r e u t e r e t a l . 1 7 3 ( 2 0 1 0 ) Surveys and the University of Michigan Transportation Research Institute survey. We provide empirical estimates of the association between proxy measures and response to the survey request as well as the actual survey outcome variables. We also compare unweighted and weighted estimates under various non-response models. Our results from multiple surveys with multiple recruitment protocols from multiple organizations on multiple topics show the difficulty of finding suitable covariates for non-response adjustment and the need to improve the quality of auxiliary data. i n J o u r n a l o f t h e ro y a l S t a t i S t i c a l S o c i e t y a
Traditional statistical analyses of interviewer effects on survey data do not examine whether these effects change over a field period. However, the nature of the survey interview is dynamic. Interviewers' behaviors and perceptions may evolve as they gain experience, thus potentially affecting data quality. This paper looks at how interview length and interviewer evaluations of respondents change over interviewers' workloads. Multilevel models with random interviewer effects are used to account for the clustering of cases within interviewers and individual interviewer characteristics in the 1984, 1988, and 2000 National Election Studies. The 1984 and 1988 NES released sample in four replicates, minimizing the confound between order in an interviewers' workload and sample composition. We find that over the course of the studies, both measures change significantly. Interviewer prior survey experience also was significantly negatively related to the length of the interview. These findings have implications for interviewer training prior to and during studies, as well as suggesting future research to reveal why these behaviors and perceptions change.
Survey research has long grappled with the concept of survey mode preference: the idea that a respondent may prefer to participate in one survey mode over another. This article experimentally examines the effect of mode preference on response, contact, and cooperation rates; mode choice; and data collection efficiency. Respondents to a 2008 telephone survey (n = 1,811; AAPOR RR3 = 38 percent) were asked their mode preference for future survey participation. These respondents were subsequently followed up in 2009 with two independent survey requests. The first follow-up survey request was another telephone survey (n = 548; AAPOR RR2 = 55.5 percent). In the second follow-up survey (n = 565; AAPOR RR2 = 46.0 percent), respondents were randomly assigned to one of four mode treatments: Web only, mail only, Web followed by mail, and mail followed by Web. We find that mode preference predicts participation in Web and phone modes, cooperation in phone mode (where contact and cooperation can be disentangled), and the selection of a mode when given the option of two modes. We find weak and mixed evidence about the relationship between mode preference and reduction of field effort. We discuss the important implications these findings have for mixed mode surveys.
Kish's (1962) classical intra-interviewer correlation (ρ int) provides survey researchers with an estimate of the effect of interviewers on variation in measurements of a survey variable of interest. This correlation is an undesirable product of the data collection process that can arise when answers from respondents interviewed by the same interviewer are more similar to each other than answers from other respondents, decreasing the precision of survey estimates. Estimation of this parameter, however, uses only respondent data. The potential contribution of variance in nonresponse errors between interviewers to the estimation of ρ int has been largely ignored. Responses within interviewers may appear correlated because the interviewers successfully obtain cooperation from different pools of respondents, not because of systematic response deviations. This study takes a first step in filling this gap in the literature on interviewer effects by analyzing a unique survey data set, collected using computerassisted telephone interviewing (CATI) from a sample of divorce records. This data set, which includes both true values and reported values for respondents and a CATI sample assignment that approximates interpen
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.