CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark

Yuan, Y.; Qingxiu, Dong,; Guan, Jian; Cao, Boxi; Zhang, Zhengyan; Xiao, Chaojun; Wang, Xiaozhi; Qi, Fanchao; Bao, Junwei Lucas; Jinran, Nie,; Zeng, Zheni; Gu, Yuxian; Zhou, Kun; Huang, Xuancheng; Li, Wenhao; Ren, Siming; Lu, Jianhua; Xu, Chunguang; Wang, Huadong; Zeng, Guoyang; Zhou, Zile; Zhang, Jiajun; Li, Juanzi; Huang, Minlie; Yan, Rui; He, Xiaodong; Wan, Xiaojun; Xin, Zhao; Sun, Xingming; Liu, Yang; Liu, Zhiyuan; Han, Xianpei; Yang, Erfu; Sui, Zhifang; Sun, Maosong

doi:10.48550/arxiv.2112.13610

Cited by 4 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comprising nine Chinese NLU tasks, the CLUE dataset evaluates LLMs in tasks like semantic matching, text classification, and reading comprehension. CUGE (Yao et al, 2021) is organized hierarchically by language-task-dataset structure, using 21 sub-datasets to evaluate LLMs in language understanding, information retrieval, Q&A, and language generation. SentEval (Conneau and Kiela, 2018) aggregates NLU datasets for 21 sub-tasks.…”

Section: Natural Language Understandingmentioning

confidence: 99%

Datasets for Large Language Models: A Comprehensive Survey

Liu,

Cao,

Liu

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: \href{https://github.com/lmmlzn/Awesome-LLMs-Datasets}{https://github.com/lmmlzn/Awesome-LLMs-Datasets}.

show abstract

Section: Natural Language Understandingmentioning

confidence: 99%

Datasets for Large Language Models: A Comprehensive Survey

Liu,

Cao,

Liu

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…For Chinese NLU, CLUE benchmark is proposed with more than 10 tasks, including most NLP problems. To evaluate the ability of pre-trained language models in both natural language understanding and generation, CUGE (Yao et al, 2021) is proposed, which is designed as a hierarchical framework via a multilevel scoring strategy. Meanwhile, to evaluate whether language models can learn a linguistic phenomenon of Chinese, Xiang et al (2021) develops CLiMP which covers 9 major Mandarin linguistic phenomena.…”

Section: Benchmarks For Pre-trained Language Modelmentioning

confidence: 99%

“…Meanwhile, FSPC (Shao et al, 2021) and CCMP are proposed for ancient poem understanding. While CUGE (Yao et al, 2021) uses CCMP as a sub-task for classical poetry matching, in this work, we apply the FSPC dataset for poetry emotion recognition.…”

Section: Benchmarks For Pre-trained Language Modelmentioning

confidence: 99%

“…In this work, to assure that the benchmark can evaluate most aspects of pre-trained models and language phenomenons, we design evaluation tasks following best practices of other NLP benchmarks Yao et al, 2021;Wang et al, 2019a,b) and suggestions from experts. Following the principles of , firstly, these tasks should vary in most aspects of NLP, including text classification, reading comprehension and machine translation, etc.…”

Section: Task Design Principlesmentioning

confidence: 99%

See 1 more Smart Citation

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

Zhou,

Chen,

Zhong

et al. 2023

Findings of the Association for Computational Linguistics: ACL 2023

View full text Add to dashboard Cite

To fully evaluate the overall performance of different NLP models in a given domain, many evaluation benchmarks are proposed, such as GLUE, SuperGLUE and CLUE. The field of natural language understanding has traditionally focused on benchmarks for various tasks in languages such as Chinese, English, and multilingual, however, there has been a lack of attention given to the area of classical Chinese, also known as "wen yan wen (文言文)", which has a rich history spanning thousands of years and holds significant cultural and academic value.For the prosperity of the NLP community, in this paper, we introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese, implementing sentence classification, sequence labeling, reading comprehension, and machine translation. We evaluate the existing pre-trained language models, which are all struggling with this benchmark. We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on classical Chinese NLU. The github repository is https://github.com/baudzhou/WYWEB.

show abstract

“…Chinese LLM Benchmarks. There have been important efforts, such as CLUE (Xu et al, 2020) and CUGE (Yao et al, 2021), made to evaluate the pre-trained language on extensive tasks in the Chinese context, which consider the traditional taxonomy of natural language understanding and generation. As these benchmarks are restricted in the prediction formats and could not fully measure the cross-task generalization of LLMs in the free-form outputs, more recent studies (Huang et al, 2023b; propose to reformat the tasks into multi-choice question answering, mostly examining the knowledge-base abilities in Chinese.…”

Section: Related Workmentioning

confidence: 99%

Seeking supervisor collaboration in a School of Sciences at a Chinese university

Li¹,

Cargill²

2019

Specialised English

View full text Add to dashboard Cite

Recent advancements in Text-to-SQL (Text2SQL) emphasize stimulating the large language models (LLM) on in-context learning, achieving significant results. Nevertheless, they face challenges when dealing with verbose database information and complex user intentions. This paper presents a two-stage framework to enhance the performance of current LLM-based natural language to SQL systems. We first introduce a novel prompt representation, called reference-enhanced representation, which includes schema information and randomly sampled cell values from tables to instruct LLMs in generating SQL queries. Then, in the first stage, question-SQL pairs are retrieved as few-shot demonstrations, prompting the LLM to generate a preliminary SQL (PreSQL). After that, the mentioned entities in PreSQL are parsed to conduct schema linking, which can significantly compact the useful information. In the second stage, with the linked schema, we simplify the prompt's schema information and instruct the LLM to produce the final SQL. Finally, as the post-refinement module, we propose using cross-consistency across different LLMs rather than self-consistency within a particular LLM. Our methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%. The codes are released on https://github.com/zhshLii/PETSQL * Equal contribution.

show abstract

CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark

Cited by 4 publications

References 0 publications

Datasets for Large Language Models: A Comprehensive Survey

Datasets for Large Language Models: A Comprehensive Survey

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

Seeking supervisor collaboration in a School of Sciences at a Chinese university

Contact Info

Product

Resources

About