We examine the problem of generating definite noun phrases that are appropriate referring expressions; that is, noun phrases that (a) successfully identify the intended referent to the hearer whilst (b) not conveying to him or her any false conversational implicatures (Grice, 1975). We review several possible computational interpretations of the conversational implicature maxims, with different computational costs, and argue that the simplest may be the best, because it seems to be closest to what human speakers do. We describe our recommended algorithm in detail, along with a specification of the resources a host system must provide in order to make use of the algorithm, and an implementation used in the natural language generation component of the IDAS system.
In this article, we g i v e a n o verview of Natural Language Generation (nlg) from an applied system-building perspective. The article includes a discussion of when nlg techniques should be used suggestions for carrying out requirements analyses and a description of the basic nlg tasks of content determination, discourse planning, sentence aggregation, lexicalization, referring expression generation, and linguistic realisation. Throughout, the emphasis is on established techniques that can be used to build simple but practical working systems now. We also provide pointers to techniques in the literature that are appropriate for more complicated scenarios.
The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.
Effective presentation of data for decision support is a major issue when large volumes\ud
of data are generated as happens in the Intensive Care Unit (ICU). Although the most\ud
common approach is to present the data graphically, it has been shown that textual\ud
summarisation can lead to improved decision making. As part of the BabyTalk project,\ud
we present a prototype, called BT-45, which generates textual summaries of about 45\ud
minutes of continuous physiological signals and discrete events (e.g.: equipment settings\ud
and drug administration). Its architecture brings together techniques from the different\ud
areas of signal processing, medical reasoning, knowledge engineering, and natural language\ud
generation. A clinical off-ward experiment in a Neonatal ICU (NICU) showed that human\ud
expert textual descriptions of NICU data lead to better decision making than classical\ud
graphical visualisation, whereas texts generated by BT-45 lead to similar quality decisionmaking\ud
as visualisations. Textual analysis showed that BT-45 texts were inferior to human\ud
expert texts in a number of ways, including not reporting temporal information as well\ud
and not producing good narratives. Despite these deficiencies, our work shows that it\ud
is possible for computer systems to generate effective textual summaries of complex\ud
continuous and discrete temporal clinical data.peer-reviewe
There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.