Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Journal
EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
Start Page
16589
End Page
16602
ISBN (of the container)
979-889176335-7
Date Issued
2025-11-04
Author(s)
Abstract
Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety data are generally inaccessible, making it difficult to fully recover the model’s original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method’s practicality and effectiveness.
Event(s)
30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Publisher
Association for Computational Linguistics
Type
conference paper
