ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

ChineseHarm-Bench
A Chinese Harmful Content Detection Benchmark

⚠️ WARNING: This paper contains content that may be toxic or offensive in nature.

Kangwei Liu^♠♡* , Siyuan Cheng^♡* , Bozhong Tian^♡* , Xiaozhuan Liang^♡ , Yuyang Yin^♡ , Meng Han^♠ , Ningyu Zhang^♠† , Bryan Hooi^♣ , Xi Chen^♡† , Shumin Deng^♣†

^♠Zhejiang University ^♡Tencent ^♣National University of Singapore

^*Equal Contribution ^†Corresponding Author

Abstract

Large language models (LLMs) have been increasingly applied to automated content harm detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for content moderation are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese content harm detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large teacher models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs.

Overview

We introduce ChineseHarm-Bench, a professionally annotated benchmark for Chinese harmful content detection, covering six key categories. It includes a knowledge rule base to enhance detection and a knowledge-augmented baseline that enables smaller LLMs to match state-of-the-art performance.

Figure 1: The benchmark construction process. For more detailed procedures, please refer to our paper.

Main Results

Table 1: Macro-F1 scores of various models on the ChineseHarm-Bench across six violation categories. We report results for state-of-the-art LLMs, lightweight models (<1B parameters), and billion-scale LLMs (1–10B parameters) under both direct prompting and fine-tuning strategies, with and without knowledge augmentation.
Gray-highlighted columns indicate our proposed strong baseline models with knowledge augmentation.

Conclusion

In this work, we introduce a comprehensive real-world benchmark for Chinese harmful content detection, encompassing multiple violation categories and accompanied by a professionally curated knowledge rule base. We further propose a knowledge-augmented strong baseline that integrates explicit knowledge rules and implicit knowledge from large teacher models. This approach enables small models to match or even outperform much larger models, without sacrificing efficiency or accessibility. Together, these contributions support practical applications and pave the way for future research on LLMs for the detection of Chinese harmful content.

BibTeX

@misc{liu2025chineseharmbenchchineseharmfulcontent, title={ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark}, author={Kangwei Liu and Siyuan Cheng and Bozhong Tian and Xiaozhuan Liang and Yuyang Yin and Meng Han and Ningyu Zhang and Bryan Hooi and Xi Chen and Shumin Deng}, year={2025}, eprint={2506.10960}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.10960}, }