Previous studies have established that pre-trained language models inherently manifest various bias. Although several debiasing strategies, such as retraining a whole model with counterfactual data, prompt tuning, and representation projection, have been introduced, they often fall short of efficiently eliminating bias or directly altering the models' biased essence. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotyped bias from language models with small editor networks. It contains a debiasing loss to guide editor networks to conduct local edits on partial parameters for debiasing, and a remaining loss to preserve the original language modeling abilities of language models during editing. Experiments demonstrate the high effectiveness and robustness of BiasEdit in eliminating bias compared to classical debiasing baselines, and its little impact on the language modeling and general capabilities of models. In addition, we conduct bias tracing and explore the effects of bias and debiasing via editing on language models.
Figure 1: Debiasing a language model with BiasEdit. s: stereotyped. a: anti-stereotyped. m: meanless.
@article{xin24biasedit,
author = {Xin Xu, Wei Xu, Ningyu Zhang},
title = {BiasEdit: Debiasing Stereotyped Language Models via Model Editing},
year = {2024},
url = {https://github.com/xxupiano/BiasEdit}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.