BiasEdit: Debiasing Stereotyped Language Models via Model Editing

Xin Xu1 , Wei Xu2 , Ningyu Zhang1 ,

1Zhejiang University 2Georgia Institute of Technology

BiasEdit is an efficient model editing method to eliminate stereotyped bias from language models with small editor networks, including a debiasing loss to guide edits on partial parameters and a remaining loss to maintain the language modeling abilities during editing. Experiments show its excellent performance on debiasing, language ability preservation, and robustness of gender reverse and semantic generality.

Abstract

Previous studies have established that pre-trained language models inherently manifest various bias. Although several debiasing strategies, such as retraining a whole model with counterfactual data, prompt tuning, and representation projection, have been introduced, they often fall short of efficiently eliminating bias or directly altering the models' biased essence. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotyped bias from language models with small editor networks. It contains a debiasing loss to guide editor networks to conduct local edits on partial parameters for debiasing, and a remaining loss to preserve the original language modeling abilities of language models during editing. Experiments demonstrate the high effectiveness and robustness of BiasEdit in eliminating bias compared to classical debiasing baselines, and its little impact on the language modeling and general capabilities of models. In addition, we conduct bias tracing and explore the effects of bias and debiasing via editing on language models.



BiasEdit

Figure 1: Debiasing a language model with BiasEdit. s: stereotyped. a: anti-stereotyped. m: meanless.


As shown in Figure 1, BiasEdit utilizes trained editor networks to produce parameter shifts for editing partial parameters of a language model. During debiasing, the debiasing loss guides the editor networks to produce parameter edits. The remaining loss preserves the original language modeling abilities of the model. After editing, an unbiased language model is obtained with the robustness of general capabilities, gender reverse and semantic generality.
Debiasing Loss Remaining Loss


Main Results





Ablation Study on Retaining Loss




Edits on different blocks




Impacts on General Capabilities




Reversing Gender Attribute Words




Semantic Generality

BibTeX


@article{xin24biasedit,
  author       = {Xin Xu, Wei Xu, Ningyu Zhang},
  title        = {BiasEdit: Debiasing Stereotyped Language Models via Model Editing},
  year         = {2024},
  url          = {https://github.com/xxupiano/BiasEdit}
}

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.