Crowdsourcing the acquisition of natural language corpora: Methods and observations

Wang, William Yang; Bohus, Dan; Kamar, Ece; Horvitz, Eric

doi:10.1109/slt.2012.6424200

Cited by 39 publications

(44 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such a workload distribution was previously described in (Wang et al, 2012) as appropriate for a betweensubject design. Each batch corresponded to one of two conditions: the first batch contained only textual/logical MRs, and the second one used only pictorial MRs.…”

Section: Results: Collected Datamentioning

confidence: 99%

“…For example, (Zaidan and Callison-Burch, 2011) showed that crowdsourcing can result in datasets of comparable quality to those created by professional translators given appropriate quality control methods. (Mairesse et al, 2010) demonstrate that crowd workers can produce NL descriptions from abstract MRs, a method which also has shown success in related NLP tasks, such as Spoken Dialogue Systems (Wang et al, 2012) or Semantic Parsing . However, when collecting corpora for training NLG systems, new challenges arise: (1) How to ensure the required high quality of the collected data?…”

Section: Introductionmentioning

confidence: 99%

“…Williams and Young, 2007) for evaluating spoken dialogue systems. We compare these pictorial MRs to text-based MRs used by previous crowd-sourcing work (Mairesse et al, 2010;Wang et al, 2012). These text-based MRs take the form of Dialogue Acts, such as inform(type [hotel],pricerange[expensive]).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Proceedings of the 9th International Natural Language Generation conference

Kok

2016

View full text Add to dashboard Cite

Section: Results: Collected Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Proceedings of the 9th International Natural Language Generation conference

Kok

2016

View full text Add to dashboard Cite

“…Crowdsourcing services such as Amazon Mechanical Turk have been used to collect paraphrase sets that serve as NLP benchmarks [8,9,24]. Essentially, workers worldwide are paid tiny amounts to paraphrase individual example sentences or concepts.…”

Section: Template Set Amplificationmentioning

confidence: 99%

NLify

Han

Philipose

2013

Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing

View full text Add to dashboard Cite

This paper presents the design and implementation of a programming system that enables third-party developers to add spoken natural language (SNL) interfaces to standalone mobile applications. The central challenge is to create statistical recognition models that are accurate and resource-efficient in the face of the variety of natural language, while requiring little specialized knowledge from developers. We show that given a few examples from the developer, it is possible to elicit comprehensive sets of paraphrases of the examples using internet crowds. The exhaustive nature of these paraphrases allows us to use relatively simple, automatically derived statistical models for speech and language understanding that perform well without per-application tuning. We have realized our design fully as an extension to the Visual Studio IDE. Based on a new benchmark dataset with 3500 spoken instances of 27 commands from 20 subjects and a small developer study, we establish the promise of our approach and the impact of various design choices.

show abstract

“…Crowdsourcing is a very popular method for various natural language and speech processing tasks [9,10,11]. Examples include sentence translation from one language to another or gathering annotations on bilingual lexical entries [12,13], as well as paraphrasing applications [14,15].…”

Section: Introductionmentioning

confidence: 99%

Spoken dialogue grammar induction from crowdsourced data

Palogiannidi

Klasinas

Potamianos

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We design and evaluate various crowdsourcing tasks for eliciting spoken dialogue data. Task design is based on an array of parameters that quantify the basic characteristics of the elicitation questions, e.g., how open-ended is a question. The crowdsourced data are used for and evaluated on the unsupervised induction of semantic classes for speech understanding grammars. We show that grammar induction performance is significantly affected by the crowdsourcing task parameters, e.g., paraphrasing tasks prime high lexical entrainment and result in poor corpus/grammar quality. The task parameters along with perplexity filters are used for corpus selection achieving grammar induction performance that is comparable to that of using in-domain spoken dialogue data.

show abstract

Crowdsourcing the acquisition of natural language corpora: Methods and observations

Cited by 39 publications

References 8 publications

Proceedings of the 9th International Natural Language Generation conference

Proceedings of the 9th International Natural Language Generation conference

NLify

Spoken dialogue grammar induction from crowdsourced data

Contact Info

Product

Resources

About