⚠️ WARNING: This paper contains content that may be toxic or offensive in nature.
Large language models (LLMs) have been increasingly applied to automated content harm detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for content moderation are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese content harm detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large teacher models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs.
Figure 1: The benchmark construction process. For more detailed procedures, please refer to our paper.
Table 1: Macro-F1 scores of various models on the ChineseHarm-Bench across six violation categories. We report results for state-of-the-art LLMs, lightweight models (<1B parameters), and billion-scale LLMs (1–10B parameters) under both direct prompting and fine-tuning strategies, with
and without knowledge augmentation.
Gray-highlighted columns indicate our proposed strong baseline models with knowledge augmentation.
In this work, we introduce a comprehensive real-world benchmark for Chinese harmful content detection, encompassing multiple violation categories and accompanied by a professionally curated knowledge rule base. We further propose a knowledge-augmented strong baseline that integrates explicit knowledge rules and implicit knowledge from large teacher models. This approach enables small models to match or even outperform much larger models, without sacrificing efficiency or accessibility. Together, these contributions support practical applications and pave the way for future research on LLMs for the detection of Chinese harmful content.
@misc{liu2025chineseharmbenchchineseharmfulcontent,
title={ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark},
author={Kangwei Liu and Siyuan Cheng and Bozhong Tian and Xiaozhuan Liang and Yuyang Yin and Meng Han and Ningyu Zhang and Bryan Hooi and Xi Chen and Shumin Deng},
year={2025},
eprint={2506.10960},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.10960},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.