Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Yu, Tao; Zhang, Rui; Yang, Kai; Yasunaga, Michihiro; Wang, Dongxu; Li, Zifan; Ma, James; Li, Irene; Yao, Qingning; Roman, Shanelle; Zhang, Zilin; Radev, Dragomir

doi:10.18653/v1/d18-1425

Cited by 440 publications

(562 citation statements)

References 35 publications

Supporting

Mentioning

560

Contrasting

Order By: Relevance

“…Recently, Yu et al (2018b) released a manually labelled dataset for parsing natural language questions into complex SQL, which facilitates related research. Yu et al (2018b)'s dataset is exclusive for English questions. Intuitively, the same semantic parsing task can be applied cross-lingual, since SQL is a universal semantic representation and database interface.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Pilot Study for Chinese SQL Semantic Parsing

Min¹,

Shi²,

Zhang

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

The task of semantic parsing is highly useful for dialogue and question answering systems. Many datasets have been proposed to map natural language text into SQL, among which the recent Spider dataset provides crossdomain samples with multiple tables and complex queries. We build a Spider dataset for Chinese, which is currently a low-resource language in this task area. Interesting research questions arise from the uniqueness of the language, which requires word segmentation, and also from the fact that SQL keywords and columns of DB tables are typically written in English. We compare character-and wordbased encoders for a semantic parser, and different embedding schemes. Results show that word-based semantic parser is subject to segmentation errors and cross-lingual word embeddings are useful for text-to-SQL.

show abstract

Section: Introductionmentioning

confidence: 99%

“…We investigate parsing Chinese questions to SQL by creating a first dataset, and empirically evaluating a strong baseline model on the dataset. In particular, we translate the Spider (Yu et al, 2018b) dataset into Chinese. Using the model of Yu et al (2018a), we compare several key model configurations.…”

Section: Introductionmentioning

confidence: 99%

A Pilot Study for Chinese SQL Semantic Parsing

Min¹,

Shi²,

Zhang

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…(1) using more intelligent interaction designs (e.g., free-form text as user feedback) to speed up the hypothesis space searching globally, (2) strengthening the world model to nail down a smaller set of plausible hypotheses based on both the initial question and user feedback, and (3) training the agent to learn to improve the parsing accuracy while minimizing the number of required human interventions over time. Table 8 shows the extended lexicon entries and grammar rules in NLG for applying our MISP-SQL agent to generate more complex SQL queries, such as those on Spider (Yu et al, 2018c). In this dataset, a SQL query can associate with multiple tables.…”

Section: Discussionmentioning

confidence: 99%

“…In MISP-SQL, we consider four syntactic categories: AGG for aggregators, OP for operators, COL for columns and Q for generated questions. However, it can be extended with more lexicon entries and grammar rules to accommodate more complex SQL in Spider (Yu et al, 2018c), which we show in Appendix A.…”

Section: Actuator: An Nl Generatormentioning

confidence: 99%

Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to-SQL Case Study

Yao¹,

Su²,

Sun³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

As a promising paradigm, interactive semantic parsing has shown to improve both semantic parsing accuracy and user confidence in the results. In this paper, we propose a new, unified formulation of the interactive semantic parsing problem, where the goal is to design a modelbased intelligent agent. The agent maintains its own state as the current predicted semantic parse, decides whether and where human intervention is needed, and generates a clarification question in natural language. A key part of the agent is a world model: it takes a percept (either an initial question or subsequent feedback from the user) and transitions to a new state. We then propose a simple yet remarkably effective instantiation of our framework, demonstrated on two text-to-SQL datasets (WikiSQL and Spider) with different state-of-the-art base semantic parsers. Compared to an existing interactive semantic parsing approach that treats the base parser as a black box, our approach solicits less user feedback but yields higher run-time accuracy.

show abstract

“…We describe the process of estimating the correctness of collected QDMR annotations. Similar to previous works (Yu et al, 2018;Kwiatkowski et al, 2019) we use expert judgements, where the experts had prepared the guidelines for the annotation task. Given a question and its annotated QDMR, (q, s) the expert determines the correctness of s using one of the following categories:…”

Section: Quality Analysismentioning

confidence: 99%

Break It Down: A Question Understanding Benchmark

Wolfson

Geva

Gupta

et al. 2020

Transactions of the Association for Computational Linguistics

107

View full text Add to dashboard Cite

Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, showing that quality QDMRs can be annotated at scale, and release the BREAK dataset, containing over 83K pairs of questions and their QDMRs. We demonstrate the utility of QDMR by showing that (a) it can be used to improve open-domain question answering on the HOTPOTQA dataset, (b) it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Last, we use BREAK to train a sequenceto-sequence model with copying that parses questions into QDMR structures, and show that it substantially outperforms several natural baselines. QDMR NL Question Decomposition SupervisionQuestion Decomposition 1. Shayne Graham 2. field goals of #1 3. yards of #2 4. number of #2 for each #3 5. #3 where #4 is two 1. papers 2. #1 in ACL 3. keywords of #2 4. number of #2 for each #3 5. #3 where #4 is more than 100 select[papers] filter[ACL] project[keywords] group[count] comparative[> ,100] select[Shayne Graham] project[field goals] project[yards] group[count] comparative[= ,two] select[objects] project[colors] group[count] superlative[max] QDMR NL

show abstract

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Cited by 440 publications

References 35 publications

A Pilot Study for Chinese SQL Semantic Parsing

A Pilot Study for Chinese SQL Semantic Parsing

Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to-SQL Case Study

Break It Down: A Question Understanding Benchmark

Contact Info

Product

Resources

About