Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Farn, Hua; Su, Hsuan; H. Kumar, Shachi; Sahay, Saurav; Chen, Shang-Tse; Lee, Hung-yi

doi:10.18653/v1/2025.findings-emnlp.901

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Journal

EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

Start Page

16589

End Page

16602

ISBN (of the container)

979-889176335-7

Date Issued

2025-11-04

Author(s)

Farn, Hua

Su, Hsuan

H. Kumar, Shachi

Sahay, Saurav

Chen, Shang-Tse

Lee, Hung-yi

DOI

10.18653/v1/2025.findings-emnlp.901

URI

https://www.scopus.com/pages/publications/105028956251

https://scholars.lib.ntu.edu.tw/handle/123456789/736823

Abstract

Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety data are generally inaccessible, making it difficult to fully recover the model’s original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method’s practicality and effectiveness.

Event(s)

30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025

Publisher

Association for Computational Linguistics

Type

conference paper

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)