Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

Rishabh Bhardwaj, Duc Anh Do, Soujanya Poria


Abstract
We propose RESTA to perform LLM realignment towards safety, which gets compromised due to downstream task fine-tuning. RESTA stands for REstoring Safety through Task Arithmetic. At its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math. We also showcase the generalizability of RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. Overall, RESTA decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model’s performance on the task. We release the source codes at: https://github.com/declare-lab/resta.
Anthology ID:
2024.acl-long.762
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14138–14149
Language:
URL:
https://aclanthology.org/2024.acl-long.762
DOI:
10.18653/v1/2024.acl-long.762
Bibkey:
Cite (ACL):
Rishabh Bhardwaj, Duc Anh Do, and Soujanya Poria. 2024. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14138–14149, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic (Bhardwaj et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.762.pdf