Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Hazoom, Moshe; Malik, Vibhor; Bogin, Ben

doi:10.48550/arxiv.2106.05006

Cited by 4 publications

(5 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…T5+Schema shows comparable performance to T5 in both databases. This result agrees with the recent finding in [12] that models trained in a single database setting do not effectively leverage schema information. Additional qualitative results are provided in Supplementary H, including SQL generation results by question complexity, time expressions, falsely executed results, and refused results.…”

Section: Results and Findingssupporting

confidence: 92%

Section: Ehrsql and Other Datasetsmentioning

confidence: 99%

“…KaggleDBQA [19] and SEDE [12] are designed to bridge the gap between academic datasets and practical usability by using real databases and naturally-occurring utterances. However, we have gone one step further where the question authors (the poll respondents) were not presented with the database schema (?Schema), which adds more reality to the dataset [12]. Finally, EHRSQL contains unanswerable questions (UnANS) that were collected together from the poll, which may play a critical role in assessing the reliability of the QA model.…”

Section: Ehrsql and Other Datasetsmentioning

confidence: 99%

“…As the queries in EHRSQL are created in response to actual needs, most of them are incompatible with the Spider parser mainly due to time-related operators like strftime, datetime and NULL. Therefore, we chose general-purpose sequence-to-sequence models, T5-base [26] and T5-base with schema serialization [30,12], as baseline models for our task. This choice aligns with a recent finding that transfer learning from pre-trained language models surpasses healthcare-specific text-to-SQL models [2].…”

Section: Model Developmentmentioning

confidence: 99%

See 3 more Smart Citations

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Lee¹,

Hwang²,

Bae³

et al. 2023

Preprint

View full text Add to dashboard Cite

We present a new text-to-SQL dataset for electronic health records (EHRs). The utterances were collected from 222 hospital staff, including physicians, nurses, insurance review and health records teams, and more. To construct the QA dataset on structured EHR data, we conducted a poll at a university hospital and templatized the responses to create seed questions. Then, we manually linked them to two open-source EHR databases-MIMIC-III and eICU-and included them with various time expressions and held-out unanswerable questions in the dataset, which were all collected from the poll. Our dataset poses a unique set of challenges: the model needs to 1) generate SQL queries that reflect a wide range of needs in the hospital, including simple retrieval and complex operations such as calculating survival rate, 2) understand various time expressions to answer time-sensitive questions in healthcare, and 3) distinguish whether a given question is answerable or unanswerable based on the prediction confidence. We believe our dataset, EHRSQL, could serve as a practical benchmark to develop and assess QA models on structured EHR data and take one step further towards bridging the gap between text-to-SQL research and its real-life deployment in healthcare.36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.

show abstract

Section: Results and Findingssupporting

confidence: 92%

Section: Ehrsql and Other Datasetsmentioning

confidence: 99%

Section: Ehrsql and Other Datasetsmentioning

confidence: 99%

Section: Model Developmentmentioning

confidence: 99%

See 2 more Smart Citations

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Lee¹,

Hwang²,

Bae³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We also observe an array of that focus on generating SQL queries from natural language. Some of these datasets are synthetic (Zhong et al, 2017), mined from StackOverflow Hazoom et al, 2021) and Github , and human-curated (Tang and Mooney, 2000;Popescu et al, 2003;Giordani and Moschitti, 2012;Li and Jagadish, 2014;Iyer et al, 2017;Yu et al, 2018;Yaghmazadeh et al, 2017;Finegan-Dollak et al, 2018;Yu et al, 2019b). Map Question-Answering.…”

Section: Datasetsmentioning

confidence: 99%

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Al-Hossami¹,

Shaikh²

2022

Preprint

View full text Add to dashboard Cite

In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle. Then, we overview the field of conversational assistants and their applications in software engineering and education. Lastly, we highlight research opportunities at the intersection of AI for code and conversational assistants and provide future directions for researching conversational assistants with CI capabilities.

show abstract

Combining Latent Factor Model for Dynamic Recommendations in Community Question Answering Forums

Usman

Ahmad

Habib

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Community Question Answering (CQA) web service provides a platform for people to share knowledge. Quora, Stack Overflow, and Yahoo! Answers are few sites where questioners post their queries and answerers respond to their respective queries. Due to the ease of use and quick responsiveness of the CQA platform, these sites are being widely adopted by the community. For better usability, there is a dire need to route the question toward the relevant answerers. To fulfil this gap, recommender systems play an important role in identifying the relevant answerers. To map the user interests more effectively, this research work proposed a dynamic feature representation of the latent user attributes for user profiling. The latent features are mapped by leveraging the Latent Dirichlet Allocation (LDA) for topic modelling of user data. The proposed recommendation model segments the user profile based on these latent user profiles incorporating the incremental learning of the users’ interests to produce the relevant recommendations in near real time. The experimental setup generated recommendation lists of variable sizes and evaluated using multiple evaluation metrics, such as mean average precision, recall, throughput, and different quality metrics, such as discounted cumulative gain and mean reciprocal rank. The results showed that the proposed model provided a better quality of recommendations in CQA forums, which is promising for future research in this domain.

show abstract

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Cited by 4 publications

References 18 publications

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Combining Latent Factor Model for Dynamic Recommendations in Community Question Answering Forums

Contact Info

Product

Resources

About