LLMs之RLHF:《LLM對齊技術的全面綜述:RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀
《A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀
地址
論文地址:https://www.arxiv.org/abs/2407.16216
時間
2024年7月23日
作者
Zhichao Wang*
, Bin Bi*
, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri,
Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng
Salesforce
總結
背景與痛點:盡管大型語言模型(LLMs)在自我監督學習、指令微調等方面有所進步,但由于訓練數據質量參差不齊,它們仍可能生成不實、有毒或無助于用戶的響應,與人類意圖不一致。現有評估指標如BLEU、ROUGE和BERTScore無法很好地捕捉人類對LLM輸出的偏好。需要將LLM與人類價值觀對齊,以避免生成不當內容。
具體解決方案:強化學習從人類反饋(RLHF):通過人類反饋來調整模型,使其輸出更符合人類期望。收集人類偏好數據集(包含提示、期望響應和不期望響應三元組),并訓練獎勵模型和強化學習策略。強化學習從AI反饋(RLAIF):利用AI生成的反饋數據,以減少人類反饋的成本。
核心思路與步驟
>> 獎勵模型:使用顯式或隱式的獎勵模型,對生成的響應進行評分。獎勵可以是響應級別或令牌級別的。利用Bradley-Terry模型,基于人類偏好數據訓練pointwise獎勵函數rφ(x,y),給定提示x和響應y,預測人類期望響應的概率。
>> 反饋機制:收集偏好反饋或二元反饋。采用成對或列表的反饋方式。利用人類或AI提供的反饋。
>> 強化學習策略:基于參考模型的強化學習,控制輸出的長度。采用不同的散度測量方法,如KL散度。選擇在線或離線的策略。以LLM為代理,獎勵模型為環境,最大化獎勵、最小化KL散度,同時避免"對齊稅"(即下游任務性能下降)。
探索了不同的獎勵模型(explicit/implicit,pointwise/preferencewise等)、反饋類型(偏好/二元、成對/列表式等)、RL目標(參考/無參考等)和優化方式(在線/離線等)。
>> 優化方法:迭代/在線偏好優化與非迭代/離線偏好優化。將指令微調與對齊過程分開或合并。
優勢:直接將人類偏好納入模型微調,提高了LLM與人類意圖的一致性。InstructGPT等RLHF模型在真實性、無害性等方面優于GPT-3等基線模型。探索了多種方法擴展RLHF框架,為進一步對齊研究奠定了基礎。
>> 成本效益:RLAIF減少了對昂貴人類反饋的依賴。
>> 靈活性:多種反饋和獎勵模型選擇,適應不同的應用場景。
>> 提高模型安全性和可靠性:通過對齊過程減少生成不當內容的風險。
總的來說,該綜述系統梳理了近兩年來LLM對齊技術的主要進展,概括了面臨的挑戰、提出的解決方案及其優缺點,為該領域的后續研究提供了全面的概覽。
Abstract
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
隨著自監督學習的進展、預訓練語料庫中數萬億個標記的可用性、指令微調的應用以及具有數十億參數的大型 Transformer 的發展,大型語言模型(LLMs)現在能夠生成真實且連貫的對人類查詢的回應。然而,訓練數據的質量參差不齊可能導致生成不符合預期的響應,這成為一個重大挑戰。在過去兩年中,提出了各種方法,從不同的角度來提升 LLM,特別是使其更好地與人類期望對齊。盡管有這些努力,但尚未出現一篇全面的綜述論文,對這些方法進行分類和詳細說明。本文旨在填補這一空白,通過將這些論文分類為不同的主題,并提供每種對齊方法的詳細解釋,幫助讀者深入了解當前領域的現狀。
1 Introduction
Over the past decades, the pretraining of LLMs through self-supervised learning [1] has seen significant advancements. These improvements have been driven by the development of larger decoder-only Transformers, the utilization of trillions of tokens, and the parallelization of computations across multiple GPUs. Following the pretraining phase, instruction tuning was employed to guide LLMs in responding to human queries. Despite these advancements, a critical issue remains unresolved: LLMs can generate undesired responses, such as providing instructions on how to commit illegal activities. To mitigate this risk, it is essential to align LLMs with human values.
Reinforcement Learning from Human Feedback (RLHF) [2, 3] has emerged as a groundbreaking technique for aligning LLMs. This approach has led to the development of powerful models such as GPT-4 [4], Claude [5], and Gemini [6]. Following the introduction of RLHF, numerous studies have explored various approaches to further align LLMs. However, there has not yet been a comprehensive review of methods for aligning LLMs with human preferences. This paper aims to fill that gap by categorically reviewing existing literature and providing detailed analyses of individual papers.
在過去幾十年里,通過自監督學習進行的 LLM 預訓練 [1] 取得了顯著進展。這些進展得益于更大規模的僅解碼 Transformer 的發展、數萬億標記的利用以及多 GPU 的并行計算。在預訓練階段之后,采用指令微調來指導 LLM 響應人類查詢。盡管取得了這些進展,但一個關鍵問題仍未解決:LLM 可能生成不符合期望的響應,例如提供如何進行非法活動的指示。為了減輕這一風險,有必要使 LLM 與人類價值觀對齊。
基于人類反饋中進行強化學習(RLHF)[2, 3] 已成為對齊 LLM 的一種突破性技術。這種方法促成了如 GPT-4 [4]、Claude [5] 和 Gemini [6] 等強大模型的發展。在 RLHF 介紹之后,眾多研究探索了各種進一步對齊 LLM 的方法。然而,尚未對對齊 LLM 的方法進行全面的綜述。本文旨在通過分類回顧現有文獻并對個別論文進行詳細分析來填補這一空白。
In this paper, we have structured our review into four main topics: 1. Reward Model; 2. Feedback; 3. Reinforcement Learning (RL); and 4. Optimization. Each topic was further divided into subtopics as shown in Figure. 1. For the Reward Model, the subtopics were: 1. Explicit Reward Model vs. Implicit Reward Model; 2. Pointwise Reward Model vs. Preference Model; 3. Response-Level Reward vs. Token-Level Reward and 4. Negative Preference Optimization. Regarding Feedback, the subtopics included: 1. Preference Feedback vs. Binary Feedback; 2. Pairwise Feedback vs. Listwise Feedback; and 3. Human Feedback vs. AI Feedback. In the RL section, the subtopics were:
1. Reference-Based RL vs. Reference-Free RL; 2. Length-Control RL; 3. Different Divergences in RL and 4. On-Policy RL vs. Off-Policy RL. For Optimization, the subtopics were: 1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization; and 3. Separating SFT and Alignment vs. Merging SFT and Alignment. Table 1 provided an analysis of all the papers reviewed in detail using these 13 evaluation metrics.
在本文中,我們將回顧結構分為四個主要主題:1. 獎勵模型;2. 反饋;3. 強化學習(RL);和 4. 優化。
每個主題進一步分為子主題,如圖 1 所示。
在獎勵模型中,子主題包括:1. 顯式獎勵模型 vs. 隱式獎勵模型;2. 點對點獎勵模型 vs. 偏好模型;3. 響應級獎勵 vs. 標記級獎勵;4. 負偏好優化。
在反饋方面,子主題包括:1. 偏好反饋 vs. 二元反饋;2. 成對反饋 vs. 列表反饋;3. 人類反饋 vs. AI 反饋。
在?RL 部分,子主題包括:1. 基于參考的 RL vs. 無參考 RL;2. 長度控制 RL;3. RL 中的不同散度;4. 策略內 RL vs. 策略外 RL。
在優化方面,子主題包括:1. 在線/迭代偏好優化 vs. 離線/非迭代偏好優化;2. 分離 SFT 和對齊 vs. 合并 SFT 和對齊。
表 1 對所有詳細審查的論文進行了這些 13 項評估指標的分析。
Figure 1: The 13 categorical directions for xPO to align an LLM with human preference
4 Future Directions未來發展方向
Based on the analysis of the reviewed papers, several research problems have been identified for further exploration.
在對文獻分析的基礎上,提出了若干有待進一步探討的研究問題。
4.1、General Tasks for Alignment Evaluation對齊評估的一般任務
When reviewing various papers, different tasks were used to evaluate the performance of these methods. However, some tasks, like GSM8K [65], which focused more on reasoning, might not be suitable for assessing alignment performance. In contrast, tasks like TruthfulQA [45] or those addressing toxicity should be prioritized for evaluating the toxicity of fine-tuned LLMs. There should be an effort to combine these tasks and create a unified leaderboard for alignment evaluation.
在回顧不同論文時,使用了不同的任務來評估這些方法的性能。然而,一些任務,如 GSM8K [65],更側重于推理,可能不適合評估對齊性能。相比之下,應優先考慮像 TruthfulQA?[45] 或處理毒性的問題來評估微調 LLM 的毒性。應努力結合這些任務,創建一個統一的對齊評估排行榜。
4.2、Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs 將隱式獎勵模型、列表偏好和 Nash 學習應用于更大規模的 LMs
Currently, implicit reward model methods have been applied only to models with up to 70B parameters. Extending these methods to even larger models, such as those the size of GPT-4 and Claude-3, can provide insights into their effectiveness compared to RLHF/PPO. Similarly, the listwise preference model warrants further investigation. In RLHF, preference datasets were collected using listwise preference but were subsequently transformed into multiple pairs of pairwise preferences. The potential issues associated with applying listwise preference models at larger scales remain to be addressed. Lastly, Nash learning can address the inconsistency among human labelers. Incorporating a Nash learning model into larger-scale LLMs can demonstrate its ability to capture the complexity of human nature.
目前,隱式獎勵模型方法僅應用于最多 70B 參數的模型。將這些方法擴展到更大的模型,如 GPT-4 和 Claude-3,可以提供關于其相較于 RLHF/PPO 的有效性的見解。類似地,列表偏好模型也值得進一步研究。在 RLHF 中,使用列表偏好收集了偏好數據集,但隨后轉化為多個成對的偏好。應用列表偏好模型于更大規模時潛在的問題仍待解決。最后,Nash 學習可以解決人類標注者之間的不一致性。將 Nash 學習模型納入更大規模的 LLM 可以展示其捕捉人類復雜性的能力。
4.3、Experiments on Binary Feedbacks二元反饋的實驗
Both KTO and DRO utilized binary feedback mechanisms, such as "thumbs up" and "thumbs down", instead of pairwise preferences. These binary feedbacks were derived from preference datasets, where desired responses were marked as positive and undesired responses as negative. Further research is needed on realistic binary datasets. Additionally, binary datasets are easier to collect compared to pairwise preference data, making it feasible to use larger-scale binary feedback datasets for alignment. However, the noise in binary feedback may be more pronounced than in preference datasets, raising the intriguing question of how to effectively filter out noisy data.
KTO 和 DRO 都使用了二元反饋機制,如“點贊”和“點踩”,而不是成對的偏好。這些二元反饋來自偏好數據集,其中期望的響應標記為正,期望之外的響應標記為負。需要進一步研究現實中的二元數據集。此外,相比于成對偏好數據,二元數據集更易收集,使得使用大規模二元反饋數據集進行對齊成為可能。然而,二元反饋中的噪音可能比偏好數據更明顯,因此如何有效過濾噪聲數據是一個值得關注的問題。
4.4、Experiments on Helpful AI Feedback有益AI反饋實驗
Current AI feedback primarily includes harmless feedback in RLAIF and feedback ranking in iterative DPO. However, in RLAIF, helpful feedback is still provided by human labelers. This approach is reasonable, as generating helpful responses is significantly more challenging than identifying harmful ones. An intriguing future direction involves using LLMs to generate helpful feedback, thereby enabling LLMs to self-improve.
當前的 AI 反饋主要包括 RLAIF 中的無害反饋和迭代 DPO 中的反饋排序。然而,在 RLAIF 中,有益的反饋仍由人類標注者提供。這種方法是合理的,因為生成有益的響應遠比識別有害的響應要困難。一個有趣的未來方向是利用 LLM 生成有益的反饋,從而使 LLM 實現自我提升。
4.5、Speeding up Nash Learning加速Nash學習
The proposed Nash learning method effectively modeled pairwise preferences and addressed inconsistencies arising from human labeling. However, it necessitated multiple iterations to converge to the optimal policy. Although the authors did not specify the time required for alignment, it was presumed to be significantly slower compared to implicit reward models such as DPO. This area warrants further research attention to speed up the Nash learning process.
提出的 Nash 學習方法有效建模了成對偏好并解決了人類標注的不一致性。然而,它需要多次迭代才能收斂到最佳策略。盡管作者未具體說明對齊所需的時間,但推測其速度明顯慢于隱式獎勵模型如 DPO。因此,這一領域需要進一步研究,以加速 Nash 學習過程。
4.6、Termination of Iterative/Online Learning迭代/在線學習的終止
When applying iterative or online training, determining when to terminate the iteration is crucial. Previous research has noted that iterative learning can sometimes degrade the performance of LLMs on specific tasks, which can be a sign of overfitting. However, identifying a reasonable epoch for stopping the iteration remains an unexplored area.
在應用迭代或在線訓練時,確定何時終止迭代至關重要。以往研究指出,迭代學習有時會導致 LLM 在特定任務上的性能下降,這可能是過擬合的跡象。然而,確定合理的停止迭代的輪次仍是一個未探索的領域。
4.7、Simplify SFT + Alignment簡化SFT +對齊
Current methodologies typically implemented SFT and alignment in a consecutive manner. However, this approach often resulted in catastrophic forgetting and rendered the training process laborious. The PAFT method mitigated catastrophic forgetting by fine-tuning SFT and alignment separately before merging them, albeit at the cost of increased complexity. Conversely, the ORPO technique integrated both processes simultaneously, but this led to a decline in performance. Thus, the challenge of effectively combining SFT and alignment to achieve high performance while maintaining efficiency remains unresolved.
當前方法通常采用連續的方式實現 SFT 和對齊。然而,這種方法往往導致災難性遺忘,并使訓練過程變得繁瑣。PAFT 方法通過在合并 SFT 和對齊之前分別微調這兩者來減輕災難性遺忘,盡管增加了復雜性。相反,ORPO 技術同時集成了這兩個過程,但這導致了性能下降。因此,如何有效地結合 SFT 和對齊以實現高性能且保持效率仍然是一個未解決的挑戰。