Chinese, as a linguistic system rich in depth and complexity, is characterized by distinctive elements such as ancient poetry, proverbs, idioms, and other cultural constructs. However, current Large Language Models (LLMs) face limitations in these specialized domains, highlighting the need for the development of comprehensive datasets that can assess, continuously update, and progressively improve these culturally-grounded linguistic competencies through targeted training optimizations. To address this gap, we introduce CKnowEdit, the first-ever Chinese knowledge editing dataset designed to correct linguistic, factual, and logical errors in LLMs. We collect seven types of knowledge from a wide range of sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, taking into account the unique polyphony, antithesis, and logical structures inherent in the Chinese language. By analyzing this dataset, we highlight the challenges current LLMs face in mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge editing techniques reveals opportunities to advance the correction of Chinese knowledge*
📚 CKnowEdit, which is uniquely characterized by its Chinese linguistic features and cultural depth, comprehensively explores Chinese-language distinctiveness and the challenges it poses to LLMs from three perspectives: Chinese Linguistics, Chinese Factual Knowledge and Chinese language-specific logic trap.
🤖 CKnowEdit consists of a total of 1854 entries, divided into 3 major categories and 10 subcategories.
📊 The empirical results of recent knowledge editing baselines on CKnowEdit, reveal their limitations when applied to Chinese literature, especially in our new evaluation paradigm.
Chinese Linguistics. Chinese linguistics studies the phonetics, vocabulary, semantics and grammar of the Chinese language, the linguistic knowledge in CKnowEdit is categorized into five subtypes. Each subtype of Chinese Linguistics knowledge presents unique challenges for LLMs. This major category includes the following 5 subcategories: Pinyin, Ancient Poetry, Classical Chinese, Idiom and Proverb.
Factual Knowledge. Factual knowledge in CKnowEdit covers key events and historical figures, regional landscapes, and unique local cultures across China. However, mainstream LLMs demonstrate notable gaps in their understanding of factual knowledge. This major category includes the following 2 subcategories: History and Geography.
Chinese language-specific logic trap. This major category includes the following 3 subcategories: Phonetic Misunderstand, Reasoning Error and Wordplay.
Data Source. We collected all types of data from 7 categories of sources, including: ancient poetry, Pinyin notation, idiom, proverb, classical Chinese, factual knowledge, ruozhiba.
Data Preprocess. We initially collected 11,981 raw data entries and filtered them using LLM (Qwen-7B-Chat).
Data Annotation. (1) The Target field is created either from the data source itself or generated by GPT and verified manually. (2) The Generalization field is created by rephrasing the prompt field. (3) The Portability field is implemented using two strategies: context switching and single-hop logic. (4) The Locality field in CKnowEdit differs from traditional knowledge editing datasets, as it selects knowledge that is different from the target but somewhat related.
egarding the three main knowledge classifications in CKnowEdit, the largest proportion is attributed to linguistic data accounts for 48.40% and the Logic reasoning data accounts for 45.63% because we found that knowledge that is highly characteristic of the Chinese language poses significant challenges for current LLMs.
Settings. We select 4 LLMs: Qwen-7B-Chat, Qwen2-7B-Instruct, DeepSeek-LLM-7B-Chat and Baichuan2-7B-Chat. We investigate 5 model editing methods, including FT-M, AdaLoRA, ROME, GRACE and AlphaEdit.
Evaluation. Unlike traditional evaluation methods(token/logit-level metrics with teacher-forcing automatio), we utilize the LLM-as-a-judge paradigm to evaluate the open-ended text generated by models. Above are detailed evaluation procedure and case.
Main Results. AdaLoRA achieves the highest Edit Success in over 70% of cases across 4 models, outperforming AlphaEdit and FT-M, which excel in 4 and 3 instances respectively but remain suboptimal overall. For Generalization and Portability metrics, AdaLoRA dominates with nearly 70% and 86% top scores, respectively, while AlphaEdit consistently performs suboptimally.
The Irreplaceability of Chinese. We selected 100 data samples from each of the three knowledge categories in CKnowEdit. These samples were first translated into English, then edited using AdaLoRA and ROME on four baseline models. The results were then translated back into Chinese and evaluated.
Language Functional Area Offset. After editing target knowledge in English, query are asked directly in Chinese to test cross-language generalization
@misc{fang2025cknoweditnewchineseknowledge,
title={CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs},
author={Jizhan Fang and Tianhe Lu and Yunzhi Yao and Ziyan Jiang and Xin Xu and Ningyu Zhang and Huajun Chen},
year={2025},
eprint={2409.05806},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.05806},
}