Paper:《Instruction Tuning for Large Language Models: A Survey—大型語(yǔ)言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀
導(dǎo)讀:2023年8月21日,浙江大學(xué)等團(tuán)隊(duì),發(fā)布了《Instruction Tuning for Large Language Models: A Survey》。指令微調(diào)是在大規(guī)模語(yǔ)言模型的基礎(chǔ)上,使用包含(指令,輸出)的監(jiān)督數(shù)據(jù)進(jìn)行進(jìn)一步訓(xùn)練,以減小模型原有的預(yù)測(cè)目標(biāo)與用戶指令之間的差距。其目的是增強(qiáng)模型的能力和可控性。
>> 指令微調(diào)的方法,包括構(gòu)建指令數(shù)據(jù)集、進(jìn)行指令微調(diào)等。構(gòu)建指令數(shù)據(jù)集可基于現(xiàn)有數(shù)據(jù)集轉(zhuǎn)換,也可以使用語(yǔ)言模型自動(dòng)生成。指令微調(diào)則是在指令數(shù)據(jù)集上進(jìn)行監(jiān)督訓(xùn)練。
>> 指令數(shù)據(jù)集的類型,包括自然指令、非自然指令、跨語(yǔ)言指令、對(duì)話指令等多種類型。
>> 應(yīng)用指令微調(diào)的語(yǔ)言模型,如InstructGPT、Alpaca、Vicuna等在大型預(yù)訓(xùn)練語(yǔ)言模型基礎(chǔ)上進(jìn)行指令微調(diào)的模型。
>> 指令微調(diào)的效果評(píng)估、分析和批評(píng),需要關(guān)注指令數(shù)據(jù)集的質(zhì)量、指令學(xué)習(xí)是否只停留在表面模仿等問(wèn)題。
>> 提高指令微調(diào)效率的方法,如基于適配器、重參數(shù)化等方法來(lái)進(jìn)行高效微調(diào)。
LLMs指令微調(diào)技術(shù)通過(guò)構(gòu)建豐富的指令數(shù)據(jù)集和采用有監(jiān)督學(xué)習(xí)的方式,能有效提升開(kāi)源LLMs的能力和可控性。主要技術(shù)點(diǎn)包括構(gòu)建多種指令數(shù)據(jù)集方式自然指令、非自然指令以及多模態(tài)指令等,采用指令微調(diào)的方法對(duì)LLMs進(jìn)行微調(diào),例如基于GPT、T5、LLaMA等骨干模型,采用LOMO、DELTA微調(diào)等高效微調(diào)技術(shù)。指令微調(diào)取得很好效果,但是否只是學(xué)習(xí)表面模式尚存在爭(zhēng)議,未來(lái)應(yīng)注重提升指導(dǎo)質(zhì)量和多方面評(píng)估。
相關(guān)文章
LLMs之Data:指令微調(diào)的簡(jiǎn)介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡(jiǎn)介、Alpaca/BELLE應(yīng)用、實(shí)戰(zhàn)案例代碼實(shí)現(xiàn)之詳細(xì)攻略
LLMs之Data:指令微調(diào)的簡(jiǎn)介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡(jiǎn)介、Alpaca/BELLE應(yīng)用、實(shí)戰(zhàn)案例代碼實(shí)現(xiàn)之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
2023年8月21日—Paper:《Instruction Tuning for Large Language Models: A Survey—大型語(yǔ)言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀
Paper:《Instruction Tuning for Large Language Models: A Survey—大型語(yǔ)言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀_一個(gè)處女座的程序猿的博客-CSDN博客
《Instruction Tuning for Large Language Models: A Survey—大型語(yǔ)言模型的指令調(diào)優(yōu)的綜述》翻譯與解讀
地址
論文地址:https://arxiv.org/abs/2308.10792
文章地址:Instruction Tuning for Large Language Models: A Survey | Papers With Code
文章地址:Instruction Tuning for Large Language Models: A Survey - AMiner
時(shí)間
2023年8月21日
作者
浙江大學(xué)等
Shengyu Zhang?, Linfeng Dong?, Xiaoya Li?, Sen Zhang?
Xiaofei Sun?, Shuhe Wang?, Jiwei Li??, Runyi Hu?
Tianwei Zhang▲, Fei Wu? and Guoyin Wang
Abstract摘要
指令微調(diào)技術(shù)(增強(qiáng)LLM的能力和可控性,有監(jiān)督微調(diào)+增量訓(xùn)練)、指令對(duì)
This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of (Instruction, Output) pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and application, along with analysis on aspects that influence the outcome of IT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research.
本文調(diào)查了指令微調(diào)(IT)領(lǐng)域中的研究工作,這是一種關(guān)鍵技術(shù),用于增強(qiáng)大型語(yǔ)言模型(LLM)的能力和可控性。指令微調(diào)是指以監(jiān)督方式進(jìn)一步訓(xùn)練LLM,使用由(Instruction, Output)對(duì)組成的數(shù)據(jù)集,從而彌合LLM的下一個(gè)詞預(yù)測(cè)目標(biāo)與用戶要求LLM遵循人類指令的目標(biāo)之間的差距。
在本工作中,我們對(duì)文獻(xiàn)進(jìn)行了系統(tǒng)回顧,包括IT的一般方法、IT數(shù)據(jù)集的構(gòu)建、IT模型的訓(xùn)練,以及應(yīng)用于不同形式、領(lǐng)域和應(yīng)用的應(yīng)用,以及影響IT結(jié)果的因素的分析(例如,指令輸出的生成、指令數(shù)據(jù)集的大小等)。我們還回顧了IT的潛在風(fēng)險(xiǎn),以及對(duì)其的批評(píng),同時(shí)還指出了現(xiàn)有策略的當(dāng)前不足之處,并提出了一些有益的研究方向。
1 Introduction引言
LLM顯著進(jìn)展(GPT-3→PaLM→LLaMA)、當(dāng)前痛點(diǎn)(訓(xùn)練目標(biāo)與用戶目標(biāo)間的不匹配)、
The field of large language models (LLMs) has witnessed remarkable progress in recent years. LLMs such as GPT-3 (Brown et al., 2020b), PaLM (Chowdhery et al., 2022), and LLaMA (Touvron et al., 2023a) have demonstrated impressive capabilities across a wide range of natural language tasks (Zhao et al., 2021; Wang et al., 2022b, 2023a; Wan et al., 2023; Sun et al., 2023c; Wei et al., 2023; Li et al., 2023a; Gao et al., 2023a; Yao et al., 2023; Yang et al., 2022a; Qian et al., 2022; Lee et al., 2022; Yang et al., 2022b; Gao et al., 2023b; Ning et al., 2023; Liu et al., 2021b; Wiegreffe et al., 2021; Sun et al., 2023b,a;Adlakha et al., 2023; Chen et al., 2023). One of the major issues with LLMs is the mismatch between the training objective and users’ objective: LLMs are typically trained on minimizing the contextual word prediction error on large corpora; while users want the model to "follow their instructions helpfully and safely" (Radford et al., 2019; Brown et al., 2020a; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022)
近年來(lái),大型語(yǔ)言模型(LLM)領(lǐng)域取得了顯著進(jìn)展。諸如GPT-3(Brown等,2020b)、PaLM(Chowdhery等,2022)和LLaMA(Touvron等,2023a)等LLM在各種自然語(yǔ)言任務(wù)中展示了令人印象深刻的能力。
LLM的一個(gè)主要問(wèn)題是訓(xùn)練目標(biāo)與用戶目標(biāo)之間的不匹配:LLM通常在最小化大型語(yǔ)料庫(kù)上的上下文詞預(yù)測(cè)誤差的基礎(chǔ)上進(jìn)行訓(xùn)練,而用戶希望模型“有助于并安全地遵循他們的指令”(Radford等,2019;Brown等,2020a;Fedus等,2021;Rae等,2021;Thoppilan等,2022)。
提出指令微調(diào)技術(shù)(解決不匹配)、指令微調(diào)的3個(gè)好處(彌合誤差+為人類提供介入模型行為的渠道+性能高效)
To address this mismatch, instruction tuning (IT) is proposed, serving as an effective technique to enhance the capabilities and controllability of large language models. It involves further training LLMs using (Instruction, Output) pairs, where INSTRUCTION denotes the human instruction for the model, and OUTPUT denotes the desired output that follows the INSTRUCTION. The benefits of IT are threefold: (1) Finetuning an LLM on the instruction dataset bridges the gap between the next-word prediction objective of LLMs and the users’ objective of instruction following; (2) IT allows for a more controllable and predictable model behavior compared to standard LLMs. The instructions serve to constrain the model’s outputs to align with the desired response characteristics or domain knowledge, providing a channel for humans to intervene with the model’s behaviors; and (3) IT is computationally efficient and can help LLMs rapidly adapt to a specific domain without extensive retraining or architectural changes.
為了解決這種不匹配,提出了指令微調(diào)(IT),作為增強(qiáng)大型語(yǔ)言模型能力和可控性的有效技術(shù)。它涉及使用(Instruction, Output)對(duì)進(jìn)一步訓(xùn)練LLM,其中指令表示模型的人類指令,輸出表示遵循指令的所需輸出。
IT的好處有三個(gè):
(1)在指令數(shù)據(jù)集上微調(diào)LLM彌合了LLM的下一個(gè)詞預(yù)測(cè)目標(biāo)與用戶遵循指令目標(biāo)之間的差距;
(2)與標(biāo)準(zhǔn)LLM相比,IT允許模型行為更可控和可預(yù)測(cè)。指令用于限制模型的輸出,使其與期望的響應(yīng)特性或領(lǐng)域知識(shí)保持一致,為人類提供介入模型行為的渠道;
(3)IT在計(jì)算上是高效的,并且可以幫助LLM在不需要大量重新訓(xùn)練或架構(gòu)更改的情況下迅速適應(yīng)特定領(lǐng)域。
指令微調(diào)的3大挑戰(zhàn):高質(zhì)量性、改善嚴(yán)重依賴數(shù)據(jù)性、可能只學(xué)皮毛性
Despite its effectiveness, IT also poses challenges: (1) Crafting high-quality instructions that properly cover the desired target behaviors is non-trivial: existing instruction datasets are usually limited in quantity, diversity, and creativity; (2) there has been an increasing concern that IT only improves on tasks that are heavily supported in the IT training dataset (Gudibande et al., 2023); and (3) there has been an intense criticism that IT only captures surface-level patterns and styles (e.g., the output format) rather than comprehending and learning the task (Kung and Peng, 2023). Improving instruction adherence and handling unanticipated model responses remain open research problems. These challenges highlight the importance of further investigations, analysis, and summarization in this field, to optimize the fine-tuning process and better understand the behavior of instruction fine-tuned LLMs.
盡管其有效性,IT也帶來(lái)了挑戰(zhàn):
(1)制定高質(zhì)量的指令以正確覆蓋所需的目標(biāo)行為并不容易:現(xiàn)有的指令數(shù)據(jù)集通常在數(shù)量、多樣性和創(chuàng)意方面受限;
(2)越來(lái)越多的人擔(dān)心,IT只會(huì)改善那些在IT訓(xùn)練數(shù)據(jù)集中得到大量支持的任務(wù)(Gudibande et al., 2023);
(3)有人強(qiáng)烈批評(píng)IT只捕獲表面模式和樣式(例如,輸出格式),而不是理解和學(xué)習(xí)任務(wù)(Kung和Peng,2023)。
改進(jìn)指令遵循和處理意外模型響應(yīng)仍然是未解決的研究問(wèn)題。
這些挑戰(zhàn)強(qiáng)調(diào)了進(jìn)一步調(diào)查、分析和總結(jié)在這一領(lǐng)域的重要性,以優(yōu)化微調(diào)過(guò)程并更好地理解經(jīng)過(guò)指令微調(diào)的LLM的行為。
In the literature, there has been an increasing research interest in analysis and discussions on LLMs, including pre-training methods (Zhao et al., 2023), reasoning abilities (Huang and Chang, 2022), downstream applications (Yang et al., 2023; Sun et al., 2023b), but rarely on the topic of LLM instruction finetuning. This survey attempts to fill this blank, organizing the most up-to-date state of knowledge on this quickly advancing field. Specifically,
>>Section 2 presents the general methodology employed in instruction fine-tuning.
>>Section 3 outlines the construction process of commonly-used IT representative datasets.
>>Section 4 presents representative instruction- finetuned models.???
>>Section 5 reviews multi-modality techniques and datasets for instruction tuning, including images, speech, and video.
>>Section 6 reviews efforts to adapt LLMs to different domains and applications using the IT strategy.
>>Section 7 reviews explorations to make instruction fine-tuning more efficient, reducing the computational and time costs associated with adapting large models.
>>Section 8 presents the evaluation of IT models, analysis on them, along with criticism against them.
在文獻(xiàn)中,人們?cè)絹?lái)越關(guān)注對(duì)LLM進(jìn)行分析和討論,包括預(yù)訓(xùn)練方法(Zhao等,2023),推理能力(Huang和Chang,2022),下游應(yīng)用(Yang等,2023;Sun等,2023b),但很少涉及LLM指令微調(diào)這個(gè)主題。本調(diào)查試圖填補(bǔ)這一空白,整理關(guān)于這一快速發(fā)展領(lǐng)域的最新知識(shí)狀態(tài)。具體而言,
第2節(jié)介紹了指令微調(diào)中采用的一般方法。
第3節(jié)概述了常用IT代表性數(shù)據(jù)集的構(gòu)建過(guò)程。
第4節(jié)介紹了代表性的經(jīng)過(guò)指令微調(diào)的模型。
第5節(jié)回顧了用于指令微調(diào)的多模態(tài)技術(shù)和數(shù)據(jù)集,包括圖像、語(yǔ)音和視頻。
第6節(jié)回顧了使用IT策略將LLM調(diào)整為不同領(lǐng)域和應(yīng)用的努力。
第7節(jié)回顧了使指令微調(diào)更高效的探索,減少與調(diào)整大型模型相關(guān)的計(jì)算和時(shí)間成本。
第8節(jié)介紹了對(duì)IT模型的評(píng)估、分析以及對(duì)它們的批評(píng)。
2、Methodology方法
In this section, we describe the general pipeline employed in instruction tuning.
在本節(jié)中,我們描述了指令微調(diào)中采用的一般流程。
2.1、Instruction Dataset Construction指令數(shù)據(jù)集構(gòu)建:
數(shù)據(jù)實(shí)例三元素:instruction【指定任務(wù)】、input【補(bǔ)充上下文】、output【預(yù)期輸出】
Each instance in an instruction dataset consists of three elements: an instruction, which is a natural language text sequence to specify the task (e.g., write a thank-you letter to XX for XX, write a blog on the topic of XX, etc); an optional input which provides supplementary information for context; and an anticipated output based on the instruction and the input.
指令數(shù)據(jù)集中的每個(gè)實(shí)例包含三個(gè)元素:
instruction:一個(gè)instruction,是一系列自然語(yǔ)言文本序列,用于指定任務(wù)(例如,為XX寫(xiě)一封感謝信,為XX寫(xiě)一篇關(guān)于XX主題的博客等);
input :可選的input ,為上下文提供補(bǔ)充信息;
output:以及基于指令和輸入預(yù)期的output 。
兩種方法構(gòu)建:T1基于現(xiàn)有數(shù)據(jù)集成策略法(Flan/P3)、T2基于指令收集【手動(dòng)/自動(dòng),如使用LLM的小型手寫(xiě)種子指令進(jìn)行擴(kuò)展】采用LLM【如GPT-3.5-Turbo/GPT4】自動(dòng)生成法(InstructWild/Self-Instruct)
There are generally two methods for constructing instruction datasets:
>>Data integration from annotated natural language datasets. In this approach, (Instruction, Output) pairs are collected from existing annotated natural language datasets by using templates to transform text-label pairs to (Instruction, Output) pairs. Datasets such as Flan (Longpre et al., 2023) and P3 (Sanh et al., 2021) are constructed based on the data integration strategy.
>>Generating outputs using LLMs: An alternate way to quickly gather the desired outputs to given instructions is to employ LLMs such as GPT-3.5-Turbo or GPT4 instead of manually collecting the outputs. Instructions can come from two sources: (1) manually collected; or (2) expanded based a small handwritten seed instructions using LLMs. Next, the collected instructions are fed to LLMs to obtain outputs. Datasets such as InstructWild (Xue et al., 2023) and Self-Instruct (Wang et al., 2022c) are geneated following this approach.
通常有兩種方法用于構(gòu)建指令數(shù)據(jù)集:
>> 基于現(xiàn)有數(shù)據(jù)集成策略法—從帶注釋的自然語(yǔ)言數(shù)據(jù)集中集成數(shù)據(jù)。在這種方法中,通過(guò)使用模板將文本-標(biāo)簽對(duì)轉(zhuǎn)換為(Instruction, Output)對(duì),從現(xiàn)有的帶注釋的自然語(yǔ)言數(shù)據(jù)集中收集(Instruction, Output)對(duì)。Flan(Longpre等,2023)和P3(Sanh等,2021)等數(shù)據(jù)集是基于數(shù)據(jù)集集成策略構(gòu)建的。
>> 采用LLM自動(dòng)生成法—使用LLM生成輸出:一種快速獲取給定指令所需輸出的替代方法是使用LLM,例如GPT-3.5-Turbo或GPT4,而不是手動(dòng)收集輸出。指令可以來(lái)自兩個(gè)來(lái)源:(1)手動(dòng)收集;或(2)使用LLM擴(kuò)展基于小型手寫(xiě)種子指令。接下來(lái),收集到的指令被輸入LLM以獲得輸出。InstructWild(Xue等,2023)和Self-Instruct(Wang等,2022c)等數(shù)據(jù)集是按照這種方法生成的。
多輪對(duì)話微調(diào)數(shù)據(jù)集:讓LLM扮演兩個(gè)對(duì)立角色來(lái)生成
For multi-turn conversational IT datasets, we can have large language models self-play different roles (user and AI assistant) to generate messages in a conversational format (Xu et al., 2023b).
對(duì)于多輪對(duì)話型的指令微調(diào)數(shù)據(jù)集,我們可以讓大型語(yǔ)言模型扮演不同角色(用戶和AI助手),以生成對(duì)話格式的消息(Xu等,2023b)。
2.2、Instruction Tuning指令微調(diào):有監(jiān)督的訓(xùn)練
Based on the collected IT dataset, a pretrained model can be directly fune-tuned in a fully- supervised manner, where given the instruction and the input, the model is trained by predicting each token in the output sequentially.
基于收集到的指令微調(diào)數(shù)據(jù)集,可以以完全監(jiān)督的方式直接微調(diào)預(yù)訓(xùn)練模型,其中在給定指令和輸入的情況下,模型通過(guò)逐個(gè)預(yù)測(cè)輸出中的每個(gè)令牌來(lái)進(jìn)行訓(xùn)練。
3、Datasets數(shù)據(jù)集:大多都是英文指令,Natural?Instructions/Unnatural?Instructions/Super-Natural?Instructions、P3/xP3、Flan?2021、Self-Instruct、Evol-Instruct、LIMA、Dolly、OpenAssistant?Conversations、Baize
In this section, we detail widely-used instruction tuning datasets in the community. Table 1 gives an overview of the datasets.
在本節(jié)中,我們?cè)敿?xì)介紹了社區(qū)中廣泛使用的指令微調(diào)數(shù)據(jù)集。表格1提供了數(shù)據(jù)集的概述。
3.1、Natural Instructions自然指令:來(lái)自193K個(gè)實(shí)例和61個(gè)NLP任務(wù),2元組{輸入,輸出}
Natural Instructions (Mishra et al., 2021) is a human-crafted English instruction dataset consisting of 193K instances, coming from 61 distinct NLP tasks. The dataset is comprised of "instructions" and "instances". Each instance in the "instructions" is a task description consisting of 7 components: title, definition, things to avoid emphasis/caution, prompt, positive example, and negative example. Subfigure (a) in Figure 2 gives an example of the "instructions". "Instances" consists of ("input", "output") pairs, which are the input data and textual result that follows the given instruction correctly. Subfigure (b) in Figure 2 gives an example of the instances.
The data comes from existing NLP datasets of 61 tasks. The authors collected the "instructions" by referring to the dataset annotating instruction file. Next, the authors constructed the "instances" by unifying data instances across all NLP datasets to ("input", "output") pairs.
Natural Instructions(Mishra等,2021)是一個(gè)人工創(chuàng)建的英語(yǔ)指令數(shù)據(jù)集,包含了193K個(gè)實(shí)例,來(lái)自61個(gè)不同的自然語(yǔ)言處理任務(wù)。數(shù)據(jù)集由“指令”和“實(shí)例”組成。
在“指令”中,每個(gè)實(shí)例是一個(gè)任務(wù)描述,包括7個(gè)組成部分:標(biāo)題、定義、避免強(qiáng)調(diào)/注意事項(xiàng)、提示、正面示例和負(fù)面示例。
圖2(a)中的子圖示例展示了“指令”的一個(gè)示例。而“實(shí)例”由(“輸入”,“輸出”)對(duì)組成,即輸入數(shù)據(jù)和按照給定指令正確生成的文本結(jié)果。圖2(b)中的子圖示例展示了“實(shí)例”的一個(gè)示例。
這些數(shù)據(jù)來(lái)自61個(gè)任務(wù)的現(xiàn)有自然語(yǔ)言處理數(shù)據(jù)集。作者通過(guò)參考數(shù)據(jù)集的指令注釋文件來(lái)收集“指令”。接下來(lái),作者通過(guò)將所有NLP數(shù)據(jù)集中的數(shù)據(jù)實(shí)例統(tǒng)一為(“輸入”,“輸出”)對(duì)來(lái)構(gòu)建“實(shí)例”。
3.2、P3公共提示池:整合170個(gè)英語(yǔ)NLP數(shù)據(jù)集和2052個(gè)英語(yǔ)提示,三元組{“輸入”【描述任務(wù)】+“答案選擇”【響應(yīng)列表】+“目標(biāo)”【正確響應(yīng)】}
P3 (Public Pool of Prompts) (Sanh et al., 2021) is an instruction fine-tuning dataset constructed by integrating 170 English NLP datasets and 2,052 English prompts. Prompts, which are sometimes named task templates, are functions that map a data instance in a conventional NLP task (e.g., question answering, text classification) to a natural language input-output pair.
Each instance in P3 has three components: "inputs", "answer_choices", and “targets". "Inputs" is a sequence of text that describes the task in natural language (e.g., "If he like Mary is true, is it also true that he like Mary’s cat?"). "Answer choices" is a list of text string that are applicable responses to the given task (e.g., ["yes", "no", "undetermined"]). "Targets" is a text string that is the correct response to the given "inputs" (e.g., "yes"). The authors built PromptSource, a tool for creating high-quality prompts collaboratively and an archive for open-sourcing high-quality prompts. the P3 dataset was built by randomly sampling a prompt from multiple prompts in the PromptSource and mapping each instance into a ("inputs", "answer choices", "targets") triplet.
P3(Public Pool of Prompts)(Sanh等,2021)是一個(gè)指令微調(diào)數(shù)據(jù)集,通過(guò)整合170個(gè)英語(yǔ)自然語(yǔ)言處理數(shù)據(jù)集和2052個(gè)英語(yǔ)提示來(lái)構(gòu)建。提示有時(shí)被稱為任務(wù)模板,是一種將傳統(tǒng)自然語(yǔ)言處理任務(wù)(例如,問(wèn)題回答、文本分類)的數(shù)據(jù)實(shí)例映射到自然語(yǔ)言輸入-輸出對(duì)的功能。
P3中的每個(gè)實(shí)例有三個(gè)組成部分:“輸入”,“答案選擇”和“目標(biāo)”。 “輸入”是一系列以自然語(yǔ)言描述任務(wù)的文本序列(例如,“如果他喜歡瑪麗是真的,那么他是否也喜歡瑪麗的貓?”)。 “答案選擇”是一個(gè)文本字符串列表,是給定任務(wù)的適用響應(yīng)(例如,“是”,“否”,“不確定”)。 “目標(biāo)”是文本字符串,是給定“輸入”的正確響應(yīng)(例如,“是”)。
作者構(gòu)建了PromptSource,這是一個(gè)協(xié)作創(chuàng)建高質(zhì)量提示的工具,也是一個(gè)開(kāi)源高質(zhì)量提示的存檔。P3數(shù)據(jù)集是通過(guò)從PromptSource中隨機(jī)抽樣選擇一個(gè)提示,將每個(gè)實(shí)例映射為一個(gè)(“輸入”,“答案選擇”,“目標(biāo)”)三元組而構(gòu)建的。
3.3、xP3跨語(yǔ)言公共提示池:46種語(yǔ)言中16類NLP任務(wù),2元組{輸入和目標(biāo)}
xP3 (Crosslingual Public Pool of Prompts) (Muennighoff et al., 2022) is a multilingual instruction dataset consisting of 16 diverse natural language tasks in 46 languages. Each instance in the dataset has two components: "inputs" and "targets". "Inputs" is a task description in natural language. "Targets" is the textual result that follows the "inputs" instruction correctly.
The original data in xP3 comes from three sources: the English instruction dataset P3, 4 English unseen tasks in P3 (e.g., translation, program synthesis), and 30 multilingual NLP datasets. The authors built the xP3 dataset by sampling human-written task templates from PromptSource and then filling templates to transform diverse NLP tasks into a unified formalization. For example, a task template for the natural language inference task is as follows: “If Premise is true, is it also true that Hypothesis?”; "yes", "maybe", no" with respect to the original task labels "entailment (0)", "neutral (1)" and "contradiction (2)".
xP3(Crosslingual Public Pool of Prompts)(Muennighoff等,2022)是一個(gè)多語(yǔ)言指令數(shù)據(jù)集,包含46種語(yǔ)言中16個(gè)不同的自然語(yǔ)言處理任務(wù)。
數(shù)據(jù)集中的每個(gè)實(shí)例有兩個(gè)組成部分:“輸入”和“目標(biāo)”。 “輸入”是自然語(yǔ)言中的任務(wù)描述。 “目標(biāo)”是按照“輸入”指令正確生成的文本結(jié)果。
xP3中的原始數(shù)據(jù)來(lái)自三個(gè)來(lái)源:英語(yǔ)指令數(shù)據(jù)集P3,P3中的4個(gè)英語(yǔ)未見(jiàn)過(guò)的任務(wù)(例如,翻譯、程序合成)以及30個(gè)多語(yǔ)言自然語(yǔ)言處理數(shù)據(jù)集。作者通過(guò)從PromptSource中隨機(jī)抽樣選擇人工編寫(xiě)的任務(wù)模板,然后填充模板,將不同的自然語(yǔ)言處理任務(wù)轉(zhuǎn)換為統(tǒng)一的形式,從而構(gòu)建了xP3數(shù)據(jù)集。
3.4、Flan 2021:將63個(gè)NLP基準(zhǔn)轉(zhuǎn)換為輸入-輸出對(duì)進(jìn)而構(gòu)建,2元組{輸入+目標(biāo)}
Flan 2021 (Longpre et al., 2023) is an English instruction dataset constructed by transforming 62 widely-used NLP benchmarks (e.g., SST-2, SNLI, AG News, MultiRC) into language input- output pairs. Each instance in the Flan 2021 has "input" and "target" components. "Input" is a sequence of text that describes a task via a natural language instruction (e.g., "determine the sentiment of the sentence ’He likes the cat.’ is positive or negative?"). "Target" is a textual result that executes the "input" instruction correctly (e.g., "positive"). The authors transformed conventional NLP datasets into input-target pairs by: Step 1: manually composing instruction and target templates; Step 2: filling templates with data instances from the dataset.
Flan 2021(Longpre等,2023)是一個(gè)英語(yǔ)指令數(shù)據(jù)集,通過(guò)將62個(gè)廣泛使用的自然語(yǔ)言處理基準(zhǔn)(例如,SST-2、SNLI、AG News、MultiRC)轉(zhuǎn)換為語(yǔ)言輸入-輸出對(duì)來(lái)構(gòu)建。Flan 2021中的每個(gè)實(shí)例包含“輸入”和“目標(biāo)”兩個(gè)組成部分。“輸入”是描述任務(wù)的自然語(yǔ)言指令序列(例如,“確定句子'他喜歡貓。'的情感是積極還是消極?”)。 “目標(biāo)”是正確執(zhí)行“輸入”指令的文本結(jié)果(例如,“積極”)。作者通過(guò)以下步驟將傳統(tǒng)的自然語(yǔ)言處理數(shù)據(jù)集轉(zhuǎn)換為輸入-目標(biāo)對(duì):
步驟1:手動(dòng)組合指令和目標(biāo)模板;
步驟2:使用數(shù)據(jù)集中的數(shù)據(jù)實(shí)例填充模板。
3.5、Unnatural Instructions非自然指令:基于InstructGPT構(gòu)建的24萬(wàn)個(gè)實(shí)例,4元組{指令+輸入+約束+輸出}
Unnatural Instructions (Honovich et al., 2022) is an instruction dataset with approximately 240,000 instances, constructed using InstructGPT (text- davinci-002) (Ouyang et al., 2022). Each instance in the dataset has four components: INSTRUCTION, INPUT, CONSTRAINTS, and OUTPUT. Instruction" is a description of the instructing task in natural language. "Input" is an argument in natural language that instantiates the instruction task.
非自然指令(Honovich等,2022)是一個(gè)包含約24萬(wàn)個(gè)實(shí)例的指令數(shù)據(jù)集,使用InstructGPT(text-davinci-002)(Ouyang等,2022)構(gòu)建而成。數(shù)據(jù)集中的每個(gè)實(shí)例有四個(gè)組成部分:指令、輸入、約束和輸出。 “指令”是自然語(yǔ)言中的指令任務(wù)描述。 “輸入”是實(shí)例化指令任務(wù)的自然語(yǔ)言參數(shù)。
3.6、Self-Instruct
LLMs之Data:指令微調(diào)的簡(jiǎn)介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡(jiǎn)介、Alpaca/BELLE應(yīng)用、實(shí)戰(zhàn)案例代碼實(shí)現(xiàn)之詳細(xì)攻略
LLMs之Data:指令微調(diào)的簡(jiǎn)介、Self Instruction思想(一種生成指令數(shù)據(jù)集的方法論—主要用在指令微調(diào)階段)的簡(jiǎn)介、Alpaca/BELLE應(yīng)用、實(shí)戰(zhàn)案例代碼實(shí)現(xiàn)之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
包含基于InstructGPT的52K個(gè)訓(xùn)練指令和252個(gè)評(píng)估指令,3元組{“指令”【定義任務(wù)】+“輸入”【指令的內(nèi)容補(bǔ)充】+“輸出”【正確結(jié)果】}
Self-Instruct (Wang et al., 2022c) is an English instruction dataset with 52K training instructions and 252 evaluation instructions, constructed using InstructGPT (Ouyang et al., 2022). Each data instance consists of "instruction", "input" and "output". "Instruction" is a task definition in natural language (e.g., "Please answer the following question."). "Input" is optional and is used as supplementary content for the instruction (e.g., "Which country’s capital is Beijing?"), and "output" is the textual result that follows the instruction correctly (e.g., "Beijing").
自我指導(dǎo)(Self-Instruct)(Wang等,2022c)是一個(gè)英語(yǔ)指令數(shù)據(jù)集,包含52K個(gè)訓(xùn)練指令和252個(gè)評(píng)估指令,使用InstructGPT(Ouyang等,2022)構(gòu)建而成。每個(gè)數(shù)據(jù)實(shí)例包括“指令”、“輸入”和“輸出”三個(gè)部分。 “指令”是自然語(yǔ)言中的任務(wù)定義(例如,“請(qǐng)回答以下問(wèn)題。”)。 “輸入”是可選的,用作指令的補(bǔ)充內(nèi)容(例如,“哪個(gè)國(guó)家的首都是北京?”),而“輸出”是正確遵循指令生成的文本結(jié)果(例如,“北京”)。
生成四步驟:構(gòu)建示例(175個(gè)種子任務(wù)來(lái)抽樣8個(gè)自然語(yǔ)言指令)來(lái)提示InstructGPT生成更多指令→判斷是否分類任務(wù)+基于給定的“指令”提示InstructGPT生成“輸入”再結(jié)合生成“輸出”→為相應(yīng)的指令任務(wù)生成“輸入”和“輸出”→后處理(過(guò)濾和刪除重復(fù))→最終得到52K個(gè)英語(yǔ)指令
The full dataset is generated based on the following steps: Step 1. The authors randomly sampled 8 natural language instructions from the 175 seed tasks as examples and prompted InstructGPT to generate more task instructions.
Step 2. The authors determined whether the instructions generated in Step 1 is a classification task. If yes, they asked InstructGPT to generate all possible options for the output based on the given instruction and randomly selected a particular output category to prompt InstructGPT to generate the corresponding "input" content. For Instructions that do not belong to a classification task, there should be countless "output" options. The authors proposed to use the Input-first strategy, where InstructGPT was prompted to generate the "input" based on the given "instruction" first and then generate the "output" according to the "instruction" and the generated "input".
Step 3. Based on results of step-2, the authors used InstructGPT to generate the "input" and "output" for corresponding instruction tasks using the output-first or input-first strategy.
Step 4. The authors post-processed (e.g., filtering out similar instructions and removing duplicate data for input and output) the generated instruction tasks and got a final number of 52K English instructions.
整個(gè)數(shù)據(jù)集是通過(guò)以下步驟生成的:
步驟1:作者隨機(jī)從175個(gè)種子任務(wù)中抽樣8個(gè)自然語(yǔ)言指令作為示例,并提示InstructGPT生成更多的任務(wù)指令。
步驟2:作者確定步驟1中生成的指令是否是分類任務(wù)。如果是,他們要求InstructGPT基于給定的指令生成所有可能的輸出選項(xiàng),并隨機(jī)選擇一個(gè)特定的輸出類別,以促使InstructGPT生成相應(yīng)的“輸入”內(nèi)容。對(duì)于不屬于分類任務(wù)的指令,應(yīng)該有無(wú)數(shù)個(gè)“輸出”選項(xiàng)。作者提出了首先生成“輸入”的策略,即首先基于給定的“指令”提示InstructGPT生成“輸入”,然后根據(jù)“指令”和生成的“輸入”生成“輸出”。
步驟3:根據(jù)步驟2的結(jié)果,作者使用InstructGPT基于輸出優(yōu)先或輸入優(yōu)先策略為相應(yīng)的指令任務(wù)生成“輸入”和“輸出”。
步驟4:作者對(duì)生成的指令任務(wù)進(jìn)行后處理(例如,過(guò)濾相似指令,刪除輸入和輸出的重復(fù)數(shù)據(jù)),得到最終的52K個(gè)英語(yǔ)指令。
3.7、Evol-Instruct:包含基于ChatGPT采用進(jìn)化策略(添加約束、增加推理步驟、復(fù)雜化輸入等)構(gòu)建的52K個(gè)訓(xùn)練指令和218個(gè)評(píng)估指令,二元組{instruction, response}
形成過(guò)程:基于52K的初始集→隨機(jī)選擇1個(gè)進(jìn)化策略讓ChatGPT重寫(xiě)指令→過(guò)濾未進(jìn)化的指令對(duì)(利用ChatGPT和規(guī)則)→利用新生成進(jìn)化指令對(duì)更新數(shù)據(jù)集→重復(fù)上述四次→收集了25萬(wàn)個(gè)指令對(duì)
Evol-Instruct (Xu et al., 2023a) is an English instruction dataset consisting of a training set with 52K instructions and an evaluation set with 218 instructions. The authors prompted ChatGPT (OpenAI, 2022) to rewrite instructions using the in-depth and in-breath evolving strategies. The in-depth evolving strategy contains five types of operations, e.g., adding constraints, increasing reasoning steps, complicating input and etc. The in-breath evolving strategy upgrades the simple instruction to a more complex one or directly generates a new instruction to increase diversity. The authors first used 52K (instruction, response) pairs as the initial set. Then they randomly sampled an evolving strategy and asked ChatGPT to rewrite the initial instruction based on the chosen evolved strategy. The author employed ChatGPT and rules to filter out no-evolved instruction pairs and updated the dataset with newly generated evolved instruction pairs. After repeating the above process 4 times, the authors collected 250K instruction pairs. Besides the train set, the authors collected 218 human-generated instructions from real scenarios (e.g., open-source projects, platforms, and forums), called the Evol- Instruct test set.
Evol-Instruct(Xu等,2023a)是一個(gè)英語(yǔ)指令數(shù)據(jù)集,包含一個(gè)包含52K個(gè)訓(xùn)練指令和218個(gè)評(píng)估指令的訓(xùn)練集。作者使用ChatGPT(OpenAI,2022)以深入和全面的進(jìn)化策略重寫(xiě)指令來(lái)構(gòu)建這個(gè)數(shù)據(jù)集。深入進(jìn)化策略包含五種類型的操作,例如添加約束、增加推理步驟、復(fù)雜化輸入等。全面進(jìn)化策略將簡(jiǎn)單指令升級(jí)為更復(fù)雜的指令,或直接生成新的指令以增加多樣性。
作者首先使用52K個(gè)?(instruction, response)對(duì)作為初始集。然后隨機(jī)選擇一個(gè)進(jìn)化策略,要求ChatGPT根據(jù)選擇的進(jìn)化策略重寫(xiě)初始指令。作者使用ChatGPT和規(guī)則來(lái)過(guò)濾掉未進(jìn)化的指令對(duì),并使用新生成的進(jìn)化指令對(duì)更新數(shù)據(jù)集。在重復(fù)上述過(guò)程4次之后,作者收集了25萬(wàn)個(gè)指令對(duì)。除了訓(xùn)練集之外,作者還從真實(shí)場(chǎng)景(例如,開(kāi)源項(xiàng)目、平臺(tái)和論壇)中收集了218個(gè)人工生成的指令,稱為Evol-Instruct測(cè)試集。
3.8、LIMA:包含1K數(shù)據(jù)實(shí)例的訓(xùn)練集(75%源自3個(gè)社區(qū)問(wèn)答網(wǎng)站)和300個(gè)實(shí)例的測(cè)試集,二元組{instruction, response}??
LIMA (Zhou et al., 2023) is an English instruction dataset consisting of a train set with 1K data instances and a test set with 300 instances. The train set contains 1K ("instruction", "response") pairs. For the training data, 75% are sampled from three community question & answers websites (i.e., Stack Exchange, wikiHow, and the Pushshift Reddit Dataset (Baumgartner et al., 2020)); 20% are manually written by a set of the authors (referred Group A) inspired by their interests; 5% are sampled from the Super-Natural Instructions dataset (Wang et al., 2022d). As for the valid set, the authors sampled 50 instances from the Group A author-written set. The test set contains 300 examples, with 76.7% written by another group (Group B) of authors and 23.3% sampled from the Pushshift Reddit Dataset (Baumgartner et al., 2020), which is a collection of questions & answers within the Reddit community.
LIMA(Zhou等,2023)是一個(gè)英語(yǔ)指令數(shù)據(jù)集,包含一個(gè)包含1K個(gè)數(shù)據(jù)實(shí)例的訓(xùn)練集和一個(gè)包含300個(gè)實(shí)例的測(cè)試集。訓(xùn)練集包含1K個(gè)(instruction, response)對(duì)。對(duì)于訓(xùn)練數(shù)據(jù),其中75%來(lái)自三個(gè)社區(qū)問(wèn)答網(wǎng)站(即Stack Exchange、wikiHow和Pushshift Reddit數(shù)據(jù)集(Baumgartner等,2020));20%由一組作者(Group A)手動(dòng)編寫(xiě),受到他們興趣的啟發(fā);5%來(lái)自Super-Natural Instructions數(shù)據(jù)集(Wang等,2022d)。至于驗(yàn)證集,作者從Group A作者編寫(xiě)的集合中抽樣了50個(gè)實(shí)例。測(cè)試集包含300個(gè)示例,其中76.7%由另一組作者(Group B)編寫(xiě),23.3%來(lái)自Pushshift Reddit數(shù)據(jù)集(Baumgartner等,2020),這是Reddit社區(qū)中的問(wèn)題和回答的集合。
3.9、Super-Natural Instructions超級(jí)自然指令:包含1616個(gè)NLP任務(wù)和500萬(wàn)個(gè)任務(wù)實(shí)例+涵蓋76種任務(wù)類型和55種語(yǔ)言,二元組(“指令”和“任務(wù)實(shí)例”)
Super Natural Instructions (Wang et al., 2022f) is a multilingual instruction collection composed of 1,616 NLP tasks and 5M task instances, covering 76 distinct task types (e.g., text classification, information extraction, text rewriting, text
composition and etc.) and 55 languages. Each task in the dataset consists of an "instruction" and "task instances". Specifically, "instruction" has three components: a "definition" that describes the task in natural language; "positive examples" that are samples of inputs and correct outputs, along with a short explanation for each; and "negative examples" that are samples of inputs and undesired outputs, along with a short explanation for each, as shown in Figure 2 (a). "Task instances" are data instances comprised of textual input and a list of acceptable textual outputs, as shown in Figure 2 (b). The original data in Super Natural Instructions comes from three sources: (1) existing public NLP datasets (e.g., CommonsenseQA); (2) applicable intermediate annotations that are generated through a crowdsourcing process (e.g., paraphrasing results to a given question during a crowdsourcing QA dataset); (3) synthetic tasks that are transformed from symbolic tasks and rephrased in a few sentences (e.g., algebraic operations like number comparison).
超級(jí)自然指令(Super Natural Instructions)(Wang等,2022f)是一個(gè)多語(yǔ)言指令收集,包含1616個(gè)自然語(yǔ)言處理任務(wù)和500萬(wàn)個(gè)任務(wù)實(shí)例,涵蓋76種不同的任務(wù)類型(例如,文本分類、信息提取、文本改寫(xiě)、文本組成等)和55種語(yǔ)言。數(shù)據(jù)集中的每個(gè)任務(wù)包括“指令”和“任務(wù)實(shí)例”兩個(gè)部分。
具體來(lái)說(shuō),“指令”有三個(gè)組成部分:以自然語(yǔ)言描述任務(wù)的“定義”;“正面示例”,它是輸入和正確輸出的示例,每個(gè)示例都附有簡(jiǎn)短的解釋;“負(fù)面示例”,它是輸入和不希望的輸出的示例,每個(gè)示例都附有簡(jiǎn)短的解釋,如圖2(a)所示。
“任務(wù)實(shí)例”是由文本輸入和可接受的文本輸出列表組成的數(shù)據(jù)實(shí)例,如圖2(b)所示。
超級(jí)自然指令中的原始數(shù)據(jù)來(lái)自三個(gè)來(lái)源:(1)現(xiàn)有的公共自然語(yǔ)言處理數(shù)據(jù)集(例如,CommonsenseQA);(2)通過(guò)眾包過(guò)程生成的適用中間注釋(例如,在眾包問(wèn)答數(shù)據(jù)集中對(duì)給定問(wèn)題進(jìn)行釋義);(3)從符號(hào)任務(wù)轉(zhuǎn)換而來(lái)且經(jīng)過(guò)重新表述的合成任務(wù),這些任務(wù)在幾句話中重新表述(例如,代數(shù)運(yùn)算,如數(shù)字比較)。
3.10、Dolly:包含15000個(gè)人工生成英語(yǔ)指令+7種特定類型
Dolly (Conover et al., 2023a) is an English instruction dataset with 15,000 human-generated data instances designed to enable LLMs to interact with users akin to ChatGPT. The dataset is designed for simulating a wide range of human behaviors, covering 7 specific types: open Q&A, closed Q&A, extracting information from Wikipedia, summarizing information from Wikipedia, brainstorming, classification, and creative writing. Examples of each task type in the dataset are shown in Table 2.
Dolly(Conover等,2023a)是一個(gè)包含15000個(gè)人工生成的數(shù)據(jù)實(shí)例的英語(yǔ)指令數(shù)據(jù)集,旨在使大型語(yǔ)言模型能夠與用戶進(jìn)行類似于ChatGPT的互動(dòng)。該數(shù)據(jù)集旨在模擬各種人類行為,涵蓋7種特定類型:開(kāi)放式問(wèn)答、封閉式問(wèn)答、從維基百科中提取信息、從維基百科中總結(jié)信息、頭腦風(fēng)暴、分類和創(chuàng)意寫(xiě)作。數(shù)據(jù)集中每種任務(wù)類型的示例如表2所示。
3.11、OpenAssistant Conversations
包含158K條消息(90K個(gè)用戶提示+68K個(gè)助手回復(fù)),35種語(yǔ)言中65K個(gè)對(duì)話樹(shù)+450K個(gè)人工注釋的質(zhì)量評(píng)分,對(duì)話樹(shù)(節(jié)點(diǎn),路徑/線程)
OpenAssistant Conversations (K?pf et al., 2023) is a human-crafted multilingual assistant-style conversation corpus consisting of 161,443 messages (i.e., 91,829 user prompts, 69,614 assistant replies) from 66,497 conversation trees in 35 languages, along with 461,292 human-annotated quality ratings. Each instance in the dataset is a conversation tree (CT). Specifically, each node in a conversation tree denotes a message generated by roles (i.e., prompter, assistant) in the conversation. A CT’s root node represents an initial prompt from the prompter, while other nodes denote replies from a prompter or an assistant. A path from the root to any node in a CT represents a valid conversation between the prompter and assistant in turns and is referred to as a thread. Figure 4 shows an example of a conversation tree consisting of 12 messages in 6 threads.
OpenAssistant Conversations(K?pf等,2023)是一個(gè)人工創(chuàng)建的多語(yǔ)言助手風(fēng)格對(duì)話語(yǔ)料庫(kù),包含161443條消息(即91829個(gè)用戶提示,69614個(gè)助手回復(fù)),來(lái)自35種語(yǔ)言中66497個(gè)對(duì)話樹(shù),同時(shí)還包含461292個(gè)人工注釋的質(zhì)量評(píng)分。
數(shù)據(jù)集中的每個(gè)實(shí)例是一個(gè)對(duì)話樹(shù)(CT)。具體來(lái)說(shuō),對(duì)話樹(shù)中的每個(gè)節(jié)點(diǎn)表示會(huì)話中角色(即提示者、助手)生成的消息。CT的根節(jié)點(diǎn)表示提示者的初始提示,而其他節(jié)點(diǎn)表示提示者或助手的回復(fù)。從根節(jié)點(diǎn)到CT中任何節(jié)點(diǎn)的路徑表示提示者和助手之間的有效會(huì)話,稱為線程。圖4顯示了一個(gè)由12條消息組成的對(duì)話樹(shù)的示例,其中包含6個(gè)線程。
五步流程收集對(duì)話樹(shù):提示者→標(biāo)記提示→擴(kuò)展樹(shù)節(jié)點(diǎn)→標(biāo)記回復(fù)→排名
The authors first collected conversation trees based on the five-step pipeline:?
Step 1. prompting: contributors performed as the prompter and crafted initial prompts;
Step 2. labeling prompts: contributors rated scores to initial prompts from step 1, and the authors chose high-quality prompts as root nodes with a balanced sampling strategy;
Step 3. expanding tree nodes: contributors added reply messages as prompter or assistant;
Step 4. labeling replies: contributors assigned scores to existing node replies;
Step 5. ranking: contributors ranked assistant replies referring to the contributor guidelines.
The tree state machine managed and tracked the state (e.g., initial state, growing state, end state) throughout the conversation crafting process. Subsequently, the OpenAssistant Conversations dataset was built by filtering out offensive and inappropriate conversation trees.
作者首先根據(jù)以下五步流程收集了對(duì)話樹(shù):
步驟1:提示者:貢獻(xiàn)者扮演提示者的角色,創(chuàng)建初始提示;
步驟2:標(biāo)記提示:貢獻(xiàn)者對(duì)步驟1中的初始提示進(jìn)行評(píng)分,作者使用平衡的抽樣策略選擇高質(zhì)量的提示作為根節(jié)點(diǎn);
步驟3:擴(kuò)展樹(shù)節(jié)點(diǎn):貢獻(xiàn)者添加提示者或助手的回復(fù)消息;
步驟4:標(biāo)記回復(fù):貢獻(xiàn)者對(duì)現(xiàn)有節(jié)點(diǎn)的回復(fù)分配分?jǐn)?shù);
步驟5:排名:貢獻(xiàn)者根據(jù)貢獻(xiàn)者指南對(duì)助手的回復(fù)進(jìn)行排名。
樹(shù)狀態(tài)機(jī)在整個(gè)對(duì)話創(chuàng)作過(guò)程中管理和跟蹤狀態(tài)(例如,初始狀態(tài)、增長(zhǎng)狀態(tài)、結(jié)束狀態(tài))。隨后,通過(guò)過(guò)濾掉冒犯性和不適當(dāng)?shù)膶?duì)話樹(shù),構(gòu)建了OpenAssistant Conversations數(shù)據(jù)集。
3.12、Baize:基于ChatGPT(self-chat思想)構(gòu)建的111.5K個(gè)實(shí)例多輪(3.4輪)聊天語(yǔ)料庫(kù),二元組{prompt,response}
Baize (Conover et al., 2023b) is an English multi- turn chat corpus with 111.5K instances constructed using ChatGPT. And each turn consists of a user’s prompt and a response from the assistant. Each instance in Baize v1 contains 3.4 turns of conversations.
To create the Baize dataset, the authors proposed self-chat, where ChatGPT plays roles of the user and the AI assistant in turns and generates messages in a conversational format. Specifically, the authors first crafted a task template that defines the roles and tasks for ChatGPT (as shown in Table 3). Next, they sampled questions (e.g., "How do you fix a Google Play Store account that isn’t working?") from Quora and Stack Overflow datasets as conversation seeds (e.g., topics). Subsequently, they prompted ChatGPT with the template and the sampled seed. ChatGPT continuously generates messages for both sides until a natural stopping point is reached.
Baize(Conover等,2023b)是一個(gè)包含111.5K個(gè)實(shí)例的英語(yǔ)多輪聊天語(yǔ)料庫(kù),使用ChatGPT構(gòu)建。每個(gè)輪次包括用戶的提示和助手的回復(fù)。Baize v1中的每個(gè)實(shí)例包含3.4輪的對(duì)話。
為了創(chuàng)建Baize數(shù)據(jù)集,作者提出了自我對(duì)話的概念,其中ChatGPT在輪流扮演用戶和AI助手的角色,以會(huì)話格式生成消息。具體來(lái)說(shuō),作者首先創(chuàng)建了一個(gè)任務(wù)模板,定義了ChatGPT的角色和任務(wù)(如表3所示)。接下來(lái),他們從Quora和Stack Overflow數(shù)據(jù)集中抽樣問(wèn)題(例如,“如何修復(fù)不工作的Google Play Store賬戶?”)作為會(huì)話種子(例如,話題)。隨后,他們使用模板和抽樣的種子提示ChatGPT。ChatGPT持續(xù)地為雙方生成消息,直到達(dá)到自然停止點(diǎn)為止。
4、Instruction Fine-tuned LLMs指導(dǎo)微調(diào)的LLM模型
In this section, we detail widely-used LLM models in the community that are trained through instruction fine-tuning.
在本節(jié)中,我們?cè)敿?xì)介紹社區(qū)中廣泛使用的通過(guò)指導(dǎo)微調(diào)訓(xùn)練的LLM模型。
4.1、InstructGPT:基于GPT-3模型+人類指導(dǎo)微調(diào)
LLMs之InstructGPT:《Training language models to follow instructions with human feedback》翻譯與解讀
LLMs之InstructGPT:《Training language models to follow instructions with human feedback》翻譯與解讀_our models generalize to the preferences of “held-_一個(gè)處女座的程序猿的博客-CSDN博客
微調(diào)三步驟(基于人類篩選指令進(jìn)行SFT→基于一個(gè)instruction多個(gè)降序的responses來(lái)訓(xùn)練RM模型→利用RL的PPO策略優(yōu)化RM模型)
InstructGPT (176B) (Ouyang et al., 2022) is initialized with GPT-3 (176B) (Brown et al., 2020b) and then fine-tuned on human instructions. The fine-tuning procedure is composed of the following three steps: (1) supervised fine-tuning (SFT) on the human-filtered instruction dataset, which is collected from Playground API history records; (2) training a reward model to predict human preferences based on an annotated dataset, which is constructed though human labors by sampling multiple responses for one instruction and rank them from the best to the worst; (3) further optimizing the model from Step 1 with new instructions and the trained reward model in step (2). Parameters are updated using the proximal policy optimization (PPO) (Schulman et al., 2017) method, a policy gradient reinforcement learning method. Steps (2) and (3) are alternated multiple times until the model performance does not significantly improve.
InstructGPT(176B)(Ouyang等,2022)以GPT-3(176B)(Brown等,2020b)為初始模型,然后在人類指導(dǎo)下進(jìn)行微調(diào)。
微調(diào)過(guò)程包括以下三個(gè)步驟:
(1)在人類篩選的指令數(shù)據(jù)集上進(jìn)行監(jiān)督微調(diào)(SFT),該數(shù)據(jù)集從Playground API歷史記錄中收集;
(2)訓(xùn)練獎(jiǎng)勵(lì)模型以預(yù)測(cè)人類偏好,基于通過(guò)人工勞動(dòng)采樣的帶注釋數(shù)據(jù)集,該數(shù)據(jù)集為一個(gè)指令采樣多個(gè)響應(yīng),并將其從最佳到最差進(jìn)行排序;
(3)使用步驟(2)中訓(xùn)練的獎(jiǎng)勵(lì)模型從步驟1中的模型和新指令進(jìn)一步優(yōu)化。參數(shù)使用近端策略優(yōu)化(PPO)(Schulman等,2017)方法進(jìn)行更新,這是一種策略梯度強(qiáng)化學(xué)習(xí)方法。步驟(2)和(3)多次交替進(jìn)行,直到模型性能不再顯著提高為止。
InstructGPT的真實(shí)性、毒性、模型性能等表現(xiàn)非常出色
Overall, InstructGPT outperforms GPT-3. For automatic evaluations, InstructGPT outperforms GPT-3 by 10% on the TruthfulQA (Lin et al., 2021) dataset in terms of truthfulness and by 7% on the RealToxicityPrompts (Gehman et al., 2020) in terms of toxicity. On NLP datasets (i.e., WSC), InstructGPT achieves comparable performance to GPT-3. For human evaluations, regarding four different aspects, including following correct instructions, following explicit constraints, fewer hallucinations, and generating appropriate responses, InstructGPT outperforms GPT-3 +10%, +20%, -20%, and +10%, respectively.
總體而言,InstructGPT在真實(shí)性QA數(shù)據(jù)集(Lin等,2021)方面比GPT-3表現(xiàn)出色,真實(shí)性方面提高了10%,在RealToxicityPrompts數(shù)據(jù)集—即評(píng)估生成文本模型的毒性(Gehman等,2020)方面提高了7%。在自然語(yǔ)言處理數(shù)據(jù)集(例如WSC)上,InstructGPT的性能與GPT-3相當(dāng)。在人類評(píng)估方面,涉及遵循正確指令、遵循明確約束、幻覺(jué)較少以及生成適當(dāng)響應(yīng)等四個(gè)不同方面,InstructGPT分別優(yōu)于GPT-3 +10%、+20%、-20%和+10%。
4.2、BLOOMZ:基于BLOOM模型+指令數(shù)據(jù)集xP3,多種任務(wù)及其數(shù)據(jù)集上表現(xiàn)均超于BLOOM
LLMs:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻譯與解讀
LLMs:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》翻譯與解讀_一個(gè)處女座的程序猿的博客-CSDN博客
BLOOMZ (176B) (Muennighoff et al., 2022) is initialized with BLOOM (176B) (Scao et al., 2022), and then fine-tuned on the instruction dataset xP3 (Muennighoff et al., 2022), a collection of human-instruction datasets in 46 languages, coming from two sources: (1) P3, which is a collection of (English instruction, English response) pairs; and (2) an (English instruction, Multilingual response) set which is transformed from multilingual NLP datasets (e.g., Chinese benchmarks) by filling task templates with pre- defined English instructions.??
For automatic evaluation, BLOOMZ performs better than BLOOM in the zero-shot setting by +10.4%, 20.5%, and 9.8% on coreference resolution, sentence completion and natural language inference datasets, respectively. For the HumanEval benchmark (Chen et al., 2021), BLOOMZ outperforms BLOOM by 10% in terms of the Pass@100 metric. For generative tasks, BLOOMZ receives +9% BLEU improvement compared to BLOOM on the lm-evaluation-harness benchmark.
BLOOMZ(176B)(Muennighoff等,2022)以BLOOM(176B)(Scao等,2022)為初始模型,然后在指令數(shù)據(jù)集xP3(Muennighoff等,2022)上進(jìn)行微調(diào)。xP3是一個(gè)包含46種語(yǔ)言的人類指令數(shù)據(jù)集的集合,來(lái)自兩個(gè)來(lái)源:
(1)P3,其中包含(英文指令,英文響應(yīng))對(duì);
(2)一個(gè)(英文指令,多語(yǔ)言響應(yīng))集,通過(guò)在多語(yǔ)言自然語(yǔ)言處理數(shù)據(jù)集(例如中文基準(zhǔn))中使用預(yù)定義的英文指令填充任務(wù)模板而轉(zhuǎn)化而來(lái)。
對(duì)于自動(dòng)評(píng)估,BLOOMZ在zero-shot設(shè)置下在共指消解、句子補(bǔ)全和自然語(yǔ)言推理數(shù)據(jù)集上分別比BLOOM提高了10.4%、20.5%和9.8%。對(duì)于HumanEval基準(zhǔn)(Chen等,2021),BLOOMZ在Pass@100度量上優(yōu)于BLOOM 10%。對(duì)于生成任務(wù),BLOOMZ在lm-evaluation-harness基準(zhǔn)上比BLOOM的BLEU分?jǐn)?shù)提高了9%。
"Pass@100" 是一種評(píng)估指標(biāo),用于衡量生成式模型在生成任務(wù)中的性能。通常,生成式模型會(huì)根據(jù)輸入生成相應(yīng)的文本輸出。
T1、BLEU指標(biāo):在文本生成任務(wù)中,一種評(píng)估方式是將生成的文本與人工提供的參考文本進(jìn)行比較,以測(cè)量生成文本的質(zhì)量。"BLEU"(Bilingual Evaluation Understudy,雙語(yǔ)評(píng)估候補(bǔ))是一種常用的自動(dòng)評(píng)估指標(biāo),用于衡量生成文本與參考文本之間的相似性。
T2、Pass@K指標(biāo):而在生成式任務(wù)中,尤其是類似問(wèn)答任務(wù)中,還有一些其他的評(píng)估指標(biāo),如"Pass@K",其中 K 代表一個(gè)特定的數(shù)值,表示模型生成的回答是否在前 K 個(gè)候選中。例如,"Pass@100" 意味著模型生成的回答是否在前100個(gè)候選中。
4.3、Flan-T5:基于T5模型+FLAN數(shù)據(jù)集微調(diào),基于JAX的T5X框架+128*TPU v4=37小時(shí)
Flan-T5 (11B) is is a large language model initialized with T5 (11B) (Raffel et al., 2019), and then fine-tuned on the FLAN dataset (Longpre et al., 2023). The FLAN dataset is a collection of (instruction, pairs) pairs, constructed from 62 datasets of 12 NLP tasks (e.g., natural language inference, commonsense reasoning, paraphrase generation) by filling templates with various instructions under a unified task formalization.
During fine-tuning, FLAN-T5 adapts the JAX- based T5X framework and selects the best model evaluated on the held-out tasks every 2k step. Compared with T5’s pre-training stage, fine-tuning costs 0.2% computational resources (approximately 128 TPU v4 chips for 37 hours).
For evaluation, FLAN-T5 (11B) outperforms T5 (11B), and achieves comparable results to larger models, including PaLM (60B) (Chowdhery et al., 2022) in the few-shot setting. FLAN- T5 outperforms T5 by +18.9%, +12.3%, +4.1%, +5.8%, +2.1%, and +8% on MMLU (Hendrycks et al., 2020), BBH (Suzgun et al., 2022), TyDiQA (Clark et al., 2020), MGSM (Shi et al., 2022), open-ended generation, and RealToxicityPrompts (Gehman et al., 2020), respectively. In few-shot settings, FLAN-T5 outperforms PaLM +1.4% and +1.2% on the BBH and TyDiQA datasets.
Flan-T5(11B)是一種大型語(yǔ)言模型,其初始化采用T5(11B)(Raffel等,2019)并在FLAN數(shù)據(jù)集(Longpre等,2023)上進(jìn)行微調(diào)。FLAN數(shù)據(jù)集是一個(gè)包含(instruction, pairs)對(duì)的集合,通過(guò)在統(tǒng)一任務(wù)規(guī)范下使用各種指令填充模板,從12個(gè)自然語(yǔ)言處理任務(wù)的62個(gè)數(shù)據(jù)集構(gòu)建而成(例如,自然語(yǔ)言推理、常識(shí)推理、釋義生成)。
在微調(diào)過(guò)程中,FLAN-T5采用基于JAX的T5X框架,并在每2k步時(shí)選擇在預(yù)留任務(wù)上評(píng)估的最佳模型。與T5的預(yù)訓(xùn)練階段相比,微調(diào)過(guò)程消耗0.2%的計(jì)算資源(大約128個(gè)TPU v4芯片,耗時(shí)37小時(shí))。
對(duì)于評(píng)估,FLAN-T5(11B)優(yōu)于T5(11B),在少樣本設(shè)置中實(shí)現(xiàn)了與更大模型(如PaLM(60B)(Chowdhery等,2022))相當(dāng)?shù)慕Y(jié)果。FLAN-T5在MMLU(Hendrycks等,2020)、BBH(Suzgun等,2022)、TyDiQA(Clark等,2020)、MGSM(Shi等,2022)、開(kāi)放式生成以及RealToxicityPrompts(Gehman等,2020)方面分別優(yōu)于T5 +18.9%、+12.3%、+4.1%、+5.8%、+2.1%和+8%。在少樣本設(shè)置中,FLAN-T5在BBH和TyDiQA數(shù)據(jù)集上分別優(yōu)于PaLM +1.4%和+1.2%。
4.4、Alpaca:基于LLaMA模型+利用InstructGPT生成指令數(shù)據(jù)集進(jìn)行微調(diào),8*A100-80G設(shè)備+混合精度AMP+DP=3小時(shí)
LLMs之Alpaca:《Alpaca: A Strong, Replicable Instruction-Following Model》翻譯與解讀
LLMs之Alpaca:《Alpaca: A Strong, Replicable Instruction-Following Model》翻譯與解讀_一個(gè)處女座的程序猿的博客-CSDN博客
Alpaca (7B) (Taori et al., 2023) is a language model trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the constructed instruction dataset generated by InstructGPT (175B, text-davinci-003) (Ouyang et al., 2022). The fine-tuning process takes around 3 hours on an 8-card 80GB A100 device with mixed precision training and fully shared data parallelism.
Alpaca (7B) achieves comparable performances to InstructGPT (175B,text-davinci-003) in terms of human evaluation. Specifically, Alpaca outperforms InstructGPT on the self-instruct dataset, garnering 90 instances of victories compared to 89 instances.
Alpaca(7B)(Taori等,2023)是一種語(yǔ)言模型,通過(guò)對(duì)由InstructGPT(175B,text-davinci-003)(Ouyang等,2022)生成的構(gòu)建指令數(shù)據(jù)集進(jìn)行微調(diào),使用LLaMA(7B)(Touvron等,2023a)完成微調(diào)。微調(diào)過(guò)程在8卡80GB A100設(shè)備上進(jìn)行,使用混合精度訓(xùn)練和完全共享的數(shù)據(jù)并行技術(shù),大約耗時(shí)3小時(shí)。
Alpaca(7B)在人類評(píng)估方面表現(xiàn)與InstructGPT(175B,text-davinci-003)相當(dāng)。具體來(lái)說(shuō),Alpaca在自我指導(dǎo)數(shù)據(jù)集上優(yōu)于InstructGPT,獲得了90次勝利,而InstructGPT獲得了89次。
4.5、Vicuna:基于LLaMA模型+利用ShareGPT的ChatGPT生成對(duì)話數(shù)據(jù)集(過(guò)濾低質(zhì)得70K)進(jìn)行微調(diào),上下文擴(kuò)到2K+GradientCheckpointing和FlashAttention(降低GPU成本)+8*A100-80G=24小時(shí)
LLMs之Vicuna:《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀
LLMs之Vicuna:《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀_一個(gè)處女座的程序猿的博客-CSDN博客
Vicuna (13B) (Chiang et al., 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on the conversational dataset generated by ChatGPT.
The authors gathered user-shared ChatGPT conversations from ShareGPT.com, and got 70K conversation records after filtering out low-quality samples. LLaMA (13B) was fine-tuned on the constructed conversation dataset using a modified loss function tailored to multi-turn conversations. To better understand long context across multiple- turn dialog, the authors expanded the max context length from 512 to 2048. For training, the authors adopted the gradient checkpointing and flash attention (Dao et al., 2022) techniques to reduce the GPU memory cost in the fine-tuning process. The fine-tuning process takes 24 hours on an 8 × 80GB A100 device with fully shared data parallelism.
The authors built a test set used exclusively to measure chatbots’ performances. They collected a test set composed by 8 question categories, such as Fermi problems, role play scenarios, coding/math tasks, etc, and then asked GPT-4 (OpenAI, 2023) to rate models’ responses considering helpfulness, relevance, accuracy, and detail. On the constructed test set, Vicuna (13B)outperforms Alpaca (13B) (Taori et al., 2023) and et al., 2022), open-ended generation, and LLaMA (13B) in 90% of the test questions, and generates equal or better rating responses compared to ChatGPT in 45% of the questions.
Vicuna(13B)(Chiang等,2023)是一種語(yǔ)言模型,通過(guò)對(duì)由ChatGPT生成的對(duì)話數(shù)據(jù)集進(jìn)行微調(diào),使用LLaMA(13B)(Touvron等,2023a)完成微調(diào)。
作者從ShareGPT.com收集了用戶分享的ChatGPT對(duì)話,并在濾除低質(zhì)量樣本后獲得了70K個(gè)對(duì)話記錄。使用經(jīng)過(guò)修改的適用于多輪對(duì)話的損失函數(shù)對(duì)LLaMA(13B)進(jìn)行了微調(diào)。
為了更好地理解多輪對(duì)話中的長(zhǎng)上下文,作者將最大上下文長(zhǎng)度從512擴(kuò)展到2048。在訓(xùn)練過(guò)程中,作者采用了GradientCheckpointing和FlashAttention(Dao等,2022)技術(shù),以減少微調(diào)過(guò)程中的GPU內(nèi)存成本。微調(diào)過(guò)程在8個(gè)80GB A100設(shè)備上進(jìn)行,使用完全共享的數(shù)據(jù)并行技術(shù),耗時(shí)24小時(shí)。
作者構(gòu)建了一個(gè)專門(mén)用于衡量聊天機(jī)器人表現(xiàn)的測(cè)試集。他們收集了一個(gè)由8個(gè)問(wèn)題類別組成的測(cè)試集,例如費(fèi)米問(wèn)題、角色扮演情景、編碼/數(shù)學(xué)任務(wù)等,然后要求GPT-4(OpenAI,2023)根據(jù)有用性、相關(guān)性、準(zhǔn)確性和細(xì)節(jié)對(duì)模型的響應(yīng)進(jìn)行評(píng)分。在構(gòu)建的測(cè)試集上,Vicuna(13B)在90%的測(cè)試問(wèn)題中優(yōu)于Alpaca(13B)、開(kāi)放式生成以及LLaMA(13B),并在45%的問(wèn)題中生成與ChatGPT相等或更好的評(píng)分響應(yīng)。
4.6、GPT-4-LLM:基于LLaMA模型+利用Alpaca的指令和GPT-4生成指令數(shù)據(jù)集進(jìn)行有監(jiān)督微調(diào)→基于構(gòu)建比較數(shù)據(jù)集(收集GPT-4、InstructGPT 等多個(gè)大模型的指令響應(yīng)+GPT-4對(duì)響應(yīng)評(píng)分1~10分)訓(xùn)練RM模型(PPO優(yōu)化),8*A100-80G+AMP+DP=3小時(shí)
AIGC之GPT-4:GPT-4的簡(jiǎn)介(核心原理/意義/亮點(diǎn)/技術(shù)點(diǎn)/缺點(diǎn)/使用建議)、使用方法、案例應(yīng)用(計(jì)算能力/代碼能力/看圖能力等)之詳細(xì)攻略
AIGC之GPT-4:GPT-4的簡(jiǎn)介(核心原理/意義/亮點(diǎn)/技術(shù)點(diǎn)/缺點(diǎn)/使用建議)、使用方法、案例應(yīng)用(計(jì)算能力/代碼能力/看圖能力等)之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
GPT-4-LLM (7B) (Peng et al., 2023) is a language model trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the GPT-4 (OpenAI, 2023) generated instruction dataset. GPT-4-LLM is initialized with LLaMA, then fine-tuned in the following two steps: (1) supervised fine- tuning on the constructed instruction dataset. The authors used the instructions from Alpaca (Taori et al., 2023), and then collected responses using GPT-4. LLaMA is fine-tuned on the GPT-4 generated dataset. The fine-tuning process takes approximately three hours on an 8*80GB A100 machine with mixed precision and fully shared data parallelism. (2) optimizing the step-1 model using the proximal policy optimization (PPO) (Schulman et al., 2017) method, the authors first built a comparison dataset by collecting responses from GPT-4, InstructGPT (Ouyang et al., 2022), and OPT-IML (Iyer et al., 2022) to a collection of instructions and then asked GPT-4 to rate each response from 1 to 10. Using the ratings, a reward model is trained based on OPT (Zhang et al., 2022a). The fine-tuned model from Step 1 is optimized by using the reward model to compute the policy gradient.?
For evaluations, GPT-4-LLM (7B) outperforms not only the baseline model Alpaca (7B), but also larger models including Alpaca (13B) and LLAMA (13B). For automated evaluation, GPT- 4-LLM (7B) outperforms Alpaca by 0.2, 0.5, and 0.7 on User-Oriented-Instructions-252 (Wang et al., 2022c), Vicuna-Instructions (Chiang et al., 2023), and Unnatural Instructions (Honovich et al., 2022) datasets, respectively. For human evaluation, regarding aspects including helpfulness, honesty, and harmlessness, GPT-4-LLM outperforms Alpaca by 11.7, 20.9, and 28.6 respectively.
GPT-4-LLM(7B)(Peng等,2023)是一種語(yǔ)言模型,通過(guò)對(duì)GPT-4(OpenAI,2023)生成的指令數(shù)據(jù)集進(jìn)行微調(diào),使用LLaMA(7B)(Touvron等,2023a)完成微調(diào)。
GPT-4-LLM首先使用LLaMA進(jìn)行初始化,然后在以下兩個(gè)步驟中進(jìn)行微調(diào):
(1)在構(gòu)建的指令數(shù)據(jù)集上進(jìn)行監(jiān)督微調(diào)。作者使用了Alpaca的指令,然后使用GPT-4生成了響應(yīng)。LLaMA在由GPT-4生成的數(shù)據(jù)集上進(jìn)行微調(diào)。微調(diào)過(guò)程在8個(gè)80GB A100設(shè)備上使用混合精度和完全共享的數(shù)據(jù)并行技術(shù),大約耗時(shí)三小時(shí)。
(2)使用近端策略優(yōu)化(PPO) (Schulman et al., 2017)方法優(yōu)化step-1模型,作者首先通過(guò)收集GPT-4、InstructGPT (Ouyang et al., 2022)和OPT-IML (Iyer et al., 2022)對(duì)指令集合的響應(yīng)構(gòu)建比較數(shù)據(jù)集,然后要求GPT-4對(duì)每個(gè)響應(yīng)進(jìn)行1到10的評(píng)分。使用評(píng)級(jí),基于OPT訓(xùn)練獎(jiǎng)勵(lì)模型(Zhang et al., 2022a)。通過(guò)使用獎(jiǎng)勵(lì)模型來(lái)計(jì)算策略梯度,對(duì)步驟1的微調(diào)模型進(jìn)行優(yōu)化。
在評(píng)估方面,GPT-4-LLM(7B)不僅優(yōu)于基準(zhǔn)模型Alpaca(7B),還優(yōu)于更大的模型,包括Alpaca(13B)和LLAMA(13B)。在自動(dòng)評(píng)估方面,GPT-4-LLM(7B)在用戶導(dǎo)向的指令-252(Wang等,2022c)、Vicuna-指令(Chiang等,2023)和非自然指令(Honovich等,2022)數(shù)據(jù)集上分別優(yōu)于Alpaca 0.2、0.5和0.7。在人類評(píng)估方面,關(guān)于可幫助性、誠(chéng)實(shí)性和無(wú)害性等四個(gè)不同方面,GPT-4-LLM分別優(yōu)于Alpaca 11.7、20.9和28.6。
4.7、Claude:基于數(shù)據(jù)集(52K指令和GPT-4生成的響應(yīng)配對(duì))進(jìn)行SFT→基于構(gòu)建比較數(shù)據(jù)集(收集GPT-3等多個(gè)大模型的指令響應(yīng)+GPT-4對(duì)響應(yīng)評(píng)分)訓(xùn)練RM模型(PPO優(yōu)化),8*A100-80G+AMP+DP=8小時(shí)
Claude is a language model trained by fine-tuning the pre-trained language model on an instruction dataset, aiming to generate helpful and harmless responses. The fine-tuning process consists of two stages: (1) supervised fine-tuning on the instruction dataset. The authors created an instruction dataset by collecting 52K different instructions, paired with responses generated by GPT-4. The fine- tuning process takes approximately eight hours on an 8-card 80GB A100 machine with mixed precision and fully shared data parallelism. (2) optimizing the step-1 model with the proximal policy optimization (Schulman et al., 2017) method. The authors first built a comparison dataset by collecting responses from multiple large language models (e.g., GPT-3 (Brown et al., 2020b)) to the given collection of instructions and then asking GPT-4 (OpenAI, 2023) to rate each response. Using the ratings, a reward model is trained. Then, the fine-tuned model from Step 1 is optimized using the reward model with the proximal policy optimization method.
Claude generates more helpful and harmless responses compared to the backbone model. For automatic evaluations, Claude outperforms GPT- 3 by 7% on the RealToxicityPrompts (Gehman et al., 2020) in terms of toxicity. For human evaluations, regarding four different aspects, including following correct instructions, following explicit constraints, fewer hallucinations, and generating appropriate responses, Claude outperforms GPT-3 (Brown et al., 2020b) +10%,+20%, -20%, and +10%. respectively.
Claude是一種語(yǔ)言模型,通過(guò)對(duì)預(yù)訓(xùn)練語(yǔ)言模型在指令數(shù)據(jù)集上進(jìn)行微調(diào),旨在生成有幫助且無(wú)害的響應(yīng)。微調(diào)過(guò)程包括兩個(gè)階段:
(1)在指令數(shù)據(jù)集上進(jìn)行監(jiān)督微調(diào)。作者通過(guò)收集了52K個(gè)不同的指令,并與GPT-4生成的響應(yīng)配對(duì),創(chuàng)建了一個(gè)指令數(shù)據(jù)集。微調(diào)過(guò)程在8卡80GB A100設(shè)備上使用混合精度和完全共享的數(shù)據(jù)并行技術(shù),大約耗時(shí)八小時(shí)。
(2)使用近端策略優(yōu)化(Schulman等,2017)方法優(yōu)化步驟1中的模型。作者首先通過(guò)收集多個(gè)大型語(yǔ)言模型(如GPT-3(Brown等,2020b))對(duì)給定指令的響應(yīng),并要求GPT-4對(duì)每個(gè)響應(yīng)進(jìn)行評(píng)分,來(lái)構(gòu)建比較數(shù)據(jù)集。使用這些評(píng)分,訓(xùn)練了一個(gè)獎(jiǎng)勵(lì)模型。然后,使用獎(jiǎng)勵(lì)模型使用近端策略優(yōu)化方法優(yōu)化步驟1中的微調(diào)模型。
與骨干模型相比,Claude生成的響應(yīng)更有幫助且無(wú)害。在自動(dòng)評(píng)估方面,Claude在RealToxicityPrompts(Gehman等,2020)方面優(yōu)于GPT-3 7%。在人類評(píng)估方面,關(guān)于遵循正確指令、遵循明確約束、幻覺(jué)較少以及生成適當(dāng)響應(yīng)等四個(gè)不同方面,Claude分別優(yōu)于GPT-3 +10%、+20%、-20%和+10%。
4.8、WizardLM:基于LLaMA模型+Evol-Instruct指令數(shù)據(jù)集(ChatGPT生成)微調(diào),8*V100 GPU+Deepspeed Zero-3技術(shù)+3個(gè)epochs =70小時(shí)
WizardLM (7B) (Xu et al., 2023a) is a language model trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the instruction dataset Evol-Instruct generated by ChatGPT (details see Section 3.7). It is fine-tuned on a subset (with 70K) of Evol-Instruct to enable a fair comparison with Vicuna (Chiang et al., 2023). The fine-tuning process takes approximately 70 hours on 3 epochs based on an 8 V100 GPU with the Deepspeed Zero-3 (Rasley et al., 2020) technique. During inference, the max generation length is 2048.
To evaluate LLMs’ performances on complex instructions, the authors collected 218 human- generated instructions from real scenarios (e.g., open-source projects, platforms, and forums), called Evol-Instruct testset.
Evaluations are conducted on the Evol-Instruct testset and Vicuna’s testset. For human evaluation, WizardLM outperforms Alpaca (7B) (Taori et al., 2023) and Vicuna (7B) by a large margins, and generates equal or better responses on 67% test samples compared to ChatGPT. Automatic evaluation is conducted by asking GPT-4 to rate LLMs’ reponses. Specifically, WizardLM gains performance boosts compared to Alpaca by +6.2%, +5.3% on the Evol-Instruct testset and Vicuna’s test sets. WizardLM achieves outperforms Vicuna by+5.8 on the Evol-Instruct testset and +1.7% on the Vicuna’s test set.
WizardLM(7B)(Xu等,2023a)是一種語(yǔ)言模型,通過(guò)對(duì)由ChatGPT生成的Evol-Instruct指令數(shù)據(jù)集進(jìn)行微調(diào),使用LLaMA(7B)(Touvron等,2023a)完成微調(diào)(詳見(jiàn)第3.7節(jié))。它在Evol-Instruct的一個(gè)子集(含70K)上進(jìn)行微調(diào),以便與Vicuna(Chiang等,2023)進(jìn)行公平比較。微調(diào)過(guò)程基于8個(gè)V100?GPU和Deepspeed Zero-3(Rasley等,2020)技術(shù),在3個(gè)epochs?內(nèi)耗時(shí)約70小時(shí)。推理過(guò)程中,最大生成長(zhǎng)度為2048。
為了評(píng)估LLM在復(fù)雜指令上的性能,作者從實(shí)際情境(例如開(kāi)源項(xiàng)目、平臺(tái)和論壇)中收集了218個(gè)人工生成的指令,稱為Evol-Instruct測(cè)試集。評(píng)估在Evol-Instruct測(cè)試集和Vicuna的測(cè)試集上進(jìn)行。在人類評(píng)估中,WizardLM在絕大多數(shù)情況下都優(yōu)于Alpaca(7B)(Taori等,2023)和Vicuna(7B),并且與ChatGPT相比,在67%的測(cè)試樣本上生成相等或更好的響應(yīng)。自動(dòng)評(píng)估通過(guò)要求GPT-4對(duì)LLM的響應(yīng)進(jìn)行評(píng)分進(jìn)行,其中更高的得分意味著更好的性能。具體來(lái)說(shuō),在Evol-Instruct測(cè)試集和Vicuna的測(cè)試集上,WizardLM在比較上優(yōu)于Alpaca +6.2%、+5.3%。WizardLM在Evol-Instruct測(cè)試集上優(yōu)于Vicuna +5.8%,在Vicuna的測(cè)試集上優(yōu)于Vicuna +1.7%。
4.9、ChatGLM2:基于GLM模型+中英文指令(1:1)的雙語(yǔ)數(shù)據(jù)集(1.4T的tokens),類似InstructGPT的三步微調(diào)策略+上下文長(zhǎng)度擴(kuò)展到32K+MQA/CM策略(降GPU成本)+需13GB的顯存(INT4量化后需6GB)
LLMs之ChatGLM2:ChatGLM2-6B的簡(jiǎn)介、安裝、使用方法之詳細(xì)攻略
LLMs之ChatGLM2:ChatGLM2-6B的簡(jiǎn)介、安裝、使用方法之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
ChatGLM2 (6B) (Du et al., 2022) is a language model trained by fine-tuning GLM (6B) (Du et al., 2022) on a bilingual dataset that contains both English and Chinese instructions The bilingual instruction dataset contains 1.4T tokens, with a 1:1 ratio of Chinese to English. Instructions in the dataset are sampled from the question-answering and dialogue completion tasks. ChatGLM is initialized with GLM, then trained by the three-step fine-tuning strategy, which is akin to InstructGPT (Ouyang et al., 2022). To better model contextual information across multi-turn conversations, the authors expanded the maximum context length from 1024 to 32K. To reduce GPU memory cost in the fine-tuning stage, the authors employed multi-query attention and causal mask strategies. During inference, ChatGLM2 requires 13GB GPU memory with FP16 and supports conversations up to 8K in length with 6GB GPU memory using the INT4 model quantization technique.?
Evaluations are conducted on four English and Chinese benchmarks, including MMLU (English) (Hendrycks et al., 2020), C-Eval (Chinese) (Huang et al., 2023), GSM8K (Math) (Cobbe et al., 2021), and BBH (English) (Suzgun et al., 2022). ChatGLM2 (6B) outperforms GLM (6B) and the baseline model ChatGLM (6B) on all benchmarks. Specifically, ChatGLM2 outperforms GLM by+3.1 on MMLU, +5.0 on C-Eval, +8.6 on GSM8K,and +2.2 on BBH. ChatGLM2 achieves better performances than ChatGLM by +2.1, +1.2, +0.4,+0.8 on MMLU, C-Eval, GSM8K and BBH, respectively.
ChatGLM2(6B)(Du等,2022)是一種語(yǔ)言模型,通過(guò)對(duì)包含英文和中文指令的雙語(yǔ)數(shù)據(jù)集進(jìn)行微調(diào),使用GLM(6B)(Du等,2022)完成微調(diào)。雙語(yǔ)指令數(shù)據(jù)集包含1.4T個(gè)標(biāo)記,中英比例為1:1。數(shù)據(jù)集中的指令來(lái)自問(wèn)答和對(duì)話完成任務(wù)。ChatGLM2初始化使用GLM,然后通過(guò)類似于InstructGPT(Ouyang等,2022)的三步微調(diào)策略進(jìn)行訓(xùn)練。
為了更好地對(duì)多輪對(duì)話中的上下文信息進(jìn)行建模,作者將最大上下文長(zhǎng)度從1024擴(kuò)展到32K。為了在微調(diào)階段降低GPU內(nèi)存成本,作者采用了多查詢注意力MQA和因果掩碼CM策略。在推理過(guò)程中,ChatGLM2需要13GB的GPU內(nèi)存,使用FP16支持最大長(zhǎng)度為8K的對(duì)話,使用INT4模型量化技術(shù)時(shí)只需要6GB的GPU內(nèi)存。
評(píng)估在四個(gè)英文和中文基準(zhǔn)數(shù)據(jù)集上進(jìn)行,包括MMLU(英文)(Hendrycks等,2020)、C-Eval(中文)(Huang等,2023)、GSM8K(數(shù)學(xué))(Cobbe等,2021)和BBH(英文)(Suzgun等,2022)。ChatGLM2(6B)在所有基準(zhǔn)數(shù)據(jù)集上優(yōu)于GLM(6B)和基準(zhǔn)模型ChatGLM(6B)。具體來(lái)說(shuō),ChatGLM2在MMLU上優(yōu)于GLM +3.1,在C-Eval上優(yōu)于GLM +5.0,在GSM8K上優(yōu)于GLM +8.6,在BBH上優(yōu)于GLM +2.2。ChatGLM2在MMLU、C-Eval、GSM8K和BBH上的性能也優(yōu)于ChatGLM +2.1、+1.2、+0.4、+0.8。
4.10、LIMA:基于LLaMA模型+基于表面對(duì)齊假設(shè)構(gòu)建的指令數(shù)據(jù)集,提出了表面對(duì)齊假設(shè)并驗(yàn)證了其效果
LIMA (65B) (Zhou et al., 2023) is a large language model trained by fine-tuning LLaMA (65B) (Touvron et al., 2023a) on an instruction dataset, which is constructed based on the proposed superficial alignment hypothesis.
The superficial alignment hypothesis refers to the idea that the knowledge and capabilities of a model are almost acquired in the pre-training stage, while the alignment training (e.g., instruction fine-tuning) teaches models to generate responses under user-preferred formalizations. Based on the superficial alignment hypothesis, the authors claimed that large language models can generate user-satisfied responses by fine-tuning it on a small fraction of instruction data. Therefore, the authors built instruction train/valid/test sets to verify this hypothesis.
Evaluations are conducted on the constructed test set. For human evaluations, LIMA outperforms InstructGPT and Alpaca by 17% and 19%, respectively. Additionally, LIMA achieves comparable results to BARD, Cladue, and GPT-4. For automatic evaluation, which is conducted by asking GPT-4 to rate responses and a higher rate score denotes better performance, LIMA outperforms InstructGPT and Alpaca by 20% and 36%, respectively, achieving comparable results to BARD, while underperforming Claude and GPT-4. Experimental results verify the proposed superficial alignment hypothesis.
LIMA(65B)(Zhou等,2023)是一種大型語(yǔ)言模型,通過(guò)對(duì)基于所提出的表面對(duì)齊假設(shè)構(gòu)建的指令數(shù)據(jù)集進(jìn)行微調(diào),使用LLaMA(65B)(Touvron等,2023a)完成微調(diào)。表面對(duì)齊假設(shè)指的是模型的知識(shí)和能力幾乎在預(yù)訓(xùn)練階段獲得,而對(duì)齊訓(xùn)練(例如指令微調(diào))則教導(dǎo)模型在用戶首選的形式化下生成響應(yīng)。基于這一表面對(duì)齊假設(shè),作者聲稱可以通過(guò)在少量指令數(shù)據(jù)上進(jìn)行微調(diào)來(lái)生成滿足用戶的響應(yīng)。因此,作者構(gòu)建了指令訓(xùn)練/驗(yàn)證/測(cè)試集來(lái)驗(yàn)證這一假設(shè)。
評(píng)估在構(gòu)建的測(cè)試集上進(jìn)行。在人類評(píng)估中,LIMA在有關(guān)方面優(yōu)于InstructGPT和Alpaca分別達(dá)到17%和19%。此外,LIMA在自動(dòng)評(píng)估方面,通過(guò)要求GPT-4對(duì)響應(yīng)進(jìn)行評(píng)分,得分越高表示性能越好,分別優(yōu)于InstructGPT和Alpaca達(dá)到20%和36%,與BARD的性能相當(dāng),但不如Claude和GPT-4。實(shí)驗(yàn)結(jié)果驗(yàn)證了提出的表面對(duì)齊假設(shè)。
4.11、Others
OPT-IML:基于OPT模型+微調(diào)IML數(shù)據(jù)集
LLMs:《OPT: Open Pre-trained Transformer Language Models》翻譯與解讀
LLMs:《OPT: Open Pre-trained Transformer Language Models》翻譯與解讀_csv數(shù)據(jù)集下載_一個(gè)處女座的程序猿的博客-CSDN博客
Dolly 2:基于Pythia模型+微調(diào)databricks-dolly-15k指令數(shù)據(jù)集
OPT-IML (175B) (Iyer et al., 2022) is a large language model trained by fine-tuning the OPT (175B) (Zhang et al., 2022a) model on the constructed Instruction Meta-Learning (IML) dataset, which consists of over 1500 NLP tasks from 8 publicly available benchmarks such as PromptSource (Bach et al., 2022), FLAN (Longpre et al., 2023), and Super-NaturalInstructions (Wang et al., 2022d). After fine-tuning, OPT-IML outperforms OPT across all benchmarks.
Dolly 2.0?(12B) (Conover et al., 2023a) is initialized with the pre-trained language model Pythia (12B) (Biderman et al., 2023), and fine- tuned on the instruction dataset databricks-dolly- 15k, which contains 7 categories of NLP tasks such as text classification and information extraction. After fine-tuning, Dolly 2.0 (12B) outperforms Pythia (12B) on the EleutherAI LLM Evaluation Harness benchmark (Gao et al., 2021) by a large margin, and achieves comparable performances to GPT-NEOX (20B) (Black et al., 2022), which has dolly-15k two times more parameters compared to Dolly 2.0 (12B).
OPT-IML(175B)(Iyer等,2022)是一種大型語(yǔ)言模型,通過(guò)對(duì)構(gòu)建的Instruction Meta-Learning(IML)數(shù)據(jù)集上的OPT(175B)(Zhang等,2022a)模型進(jìn)行微調(diào),該數(shù)據(jù)集包含來(lái)自8個(gè)公開(kāi)可用基準(zhǔn)數(shù)據(jù)集的1500多個(gè)NLP任務(wù),如PromptSource(Bach等,2022)、FLAN(Longpre等,2023)和Super-NaturalInstructions(Wang等,2022d)。微調(diào)后,OPT-IML在所有基準(zhǔn)數(shù)據(jù)集上優(yōu)于OPT。
Dolly 2.0(12B)(Conover等,2023a)通過(guò)在databricks-dolly-15k指令數(shù)據(jù)集上進(jìn)行微調(diào),使用Pythia(12B)(Biderman等,2023)進(jìn)行初始化,該數(shù)據(jù)集包含文本分類和信息提取等7類NLP任務(wù)。微調(diào)后,Dolly 2.0(12B)在EleutherAI LLM 評(píng)估套件基準(zhǔn)(Gao等,2021)上遠(yuǎn)遠(yuǎn)優(yōu)于Pythia(12B),并在性能上與擁有兩倍參數(shù)的GPT-NEOX(20B)(Black等,2022)達(dá)到相當(dāng)?shù)男阅堋?div style="height:15px;">
Falcon-Instruct:基于Falcon模型+微調(diào)英語(yǔ)對(duì)話數(shù)據(jù)集(Baize數(shù)據(jù)集150M/1.5億tokens+RefinedWeb數(shù)據(jù)集),降內(nèi)存(Flash Attention+MQ)
LLMs之Data:《The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only》翻譯與解讀
https://yunyaniu.blog.csdn.net/article/details/131137560
Guanaco:基于LLaMA+微調(diào)多語(yǔ)言對(duì)話數(shù)據(jù)集(源自包含52K英文指令數(shù)據(jù)對(duì)的Alpaca+534K的多輪對(duì)話的多語(yǔ)言)
LLMs之Guanaco:《QLoRA:Efficient Finetuning of Quantized LLMs》翻譯與解讀
LLMs之Guanaco:《QLoRA:Efficient Finetuning of Quantized LLMs》翻譯與解讀_一個(gè)處女座的程序猿的博客-CSDN博客
Falcon-Instruct (40B) (Almazrouei et al., 2023a) is a large language model trained by fine- tuning Falcon (40B) (Almazrouei et al., 2023b) on an English dialogue dataset, which contains 150 million tokens from the Baize dataset (Xu et al., 2023c), with an additional 5% of the data from the RefinedWeb dataset (Penedo et al., 2023). To reduce memory usage, the authors employed flash attention (Dao et al., 2022) and multi-query techniques. For evaluation, Falcon- Instruct (40B) achieved better performance on the Open LLM Leaderboard (Beeching et al., 2023) compared to the baseline model Falcon (40B), and outperforms the Guanaco (65B), which has more model parameters.
Guanaco (7B) (JosephusCheung, 2021) is a multi-turn dialog language model trained by fine- tuning LLaMA (7B) (Touvron et al., 2023a) on the constructed multilingual dialogue dataset. The multilingual dialogue dataset comes from two sources: Alpaca (Taori et al., 2023), which contains 52K English instruction data pairs; and a multilingual (e.g., Simplified Chinese, Traditional Chinese, Japanese, German) dialogue data, which contains 534K+ multi-turn conversations. After fine-tuning, Guanaco is to generate role-specific responses and continuous responses on a given topic in multi-turn conversations.
Falcon-Instruct?(40B) (Almazrouei等人,2023a)是一個(gè)大型語(yǔ)言模型,它是通過(guò)對(duì)Falcon (40B) (Almazrouei等人,2023b)在英語(yǔ)對(duì)話數(shù)據(jù)集上進(jìn)行微調(diào)訓(xùn)練而成的,該數(shù)據(jù)集包含來(lái)自Baize數(shù)據(jù)集(Xu等人,2023c)的1.5億個(gè)令牌,以及來(lái)自RefinedWeb數(shù)據(jù)集(Penedo等人,2023)的額外5%的數(shù)據(jù)。為了減少內(nèi)存使用,作者采用了Flash Attention?(Dao et al., 2022)和多查詢技術(shù)。在評(píng)估中,Falcon- Instruct (40B)在Open LLM排行榜(Beeching et al., 2023)上的表現(xiàn)優(yōu)于基線模型Falcon (40B),優(yōu)于模型參數(shù)更多的Guanaco (65B)。
Guanaco(7B)(JosephusCheung,2021)是一種多輪對(duì)話語(yǔ)言模型,通過(guò)在構(gòu)建的多語(yǔ)言對(duì)話數(shù)據(jù)集上進(jìn)行微調(diào),使用LLaMA(7B)(Touvron等,2023a)進(jìn)行初始化。多語(yǔ)言對(duì)話數(shù)據(jù)集來(lái)自兩個(gè)來(lái)源:包含52K英文指令數(shù)據(jù)對(duì)的Alpaca(Taori等,2023);以及包含534K+多輪對(duì)話的多語(yǔ)言(例如簡(jiǎn)體中文、繁體中文、日語(yǔ)、德語(yǔ))對(duì)話數(shù)據(jù)。微調(diào)后,Guanaco用于在多輪對(duì)話中生成針對(duì)角色的響應(yīng)和給定主題的連續(xù)響應(yīng)。
Minotaur:基于Starcoder Plus模型+微調(diào)WizardLM和GPTeacher-General-Instruc指令數(shù)據(jù)集
Nous-Herme:基于LLaMA模型+微調(diào)BiologyPhysicsChemistry子集的300K個(gè)指令
Minotaur (15B) is a large language model trained by fine-tuning the Starcoder Plus (15B) (Li et al., 2023f) on open-source instruction datasets including WizardLM (Xu et al., 2023a) and GPTeacher-General-Instruct. For model inference, Minotaur supports a maximum context length of 18K tokens.
Nous-Herme (13B) is a large language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on an instruction dataset, which contains over 300k instructions, sampled from GPTeacher, CodeAlpaca (Chaudhary, 2023), GPT-4-LLM (Peng et al., 2023), Unnatural Instructions (Honovich et al., 2022), and BiologyPhysicsChemistry subsets in the Camel- AI (Li et al., 2023c). Responses are generated by GPT-4. For evaluations, Nous-Herme (13B) achieves comparable performances to GPT-3.5- turbo on multiple tasks like ARC challenge (Clark et al., 2018) and BoolQ (Clark et al., 2019).
Minotaur(15B)是一種大型語(yǔ)言模型,通過(guò)在包括WizardLM(Xu等,2023a)和GPTeacher-General-Instruct在內(nèi)的開(kāi)源指令數(shù)據(jù)集上,微調(diào)Starcoder Plus(15B)(Li等,2023f)。在模型推理階段,Minotaur支持最大上下文長(zhǎng)度為18K標(biāo)記。
Nous-Herme(13B)是一種大型語(yǔ)言模型,通過(guò)在基于GPTeacher、CodeAlpaca(Chaudhary,2023)、GPT-4-LLM(Peng等,2023)、Unnatural Instructions(Honovich等,2022)以及Camel-AI(Li等,2023c)中的BiologyPhysicsChemistry子集中,包含超過(guò)300K個(gè)指令的指令數(shù)據(jù)集上進(jìn)行微調(diào),使用LLaMA(13B)(Touvron等,2023a)進(jìn)行初始化。評(píng)估結(jié)果顯示,Nous-Herme(13B)在多個(gè)任務(wù)(如ARC挑戰(zhàn)和BoolQ)上與GPT-3.5-turbo的性能相當(dāng)。
TüLU :基于OPT 模型+微調(diào)混合指令數(shù)據(jù)集
YuLan-Chat:基于LLaMA模型+微調(diào)雙語(yǔ)數(shù)據(jù)集(25萬(wàn)個(gè)中英文指令對(duì))
TüLU (6.7B) (Wang et al., 2023c) is a large language model trained by fine-tuning OPT (6.7B) (Zhang et al., 2022a) on a mixed instruction dataset, which contains FLAN V2 (Longpre et al., 2023), CoT (Wei et al., 2022), Dolly (Conover et al., 2023a), Open Assistant-1, GPT4-Alpaca, Code-Alpaca (Chaudhary, 2023), and ShareGPT. After fine-tuning, TüLU (6.7B) reaches on average 83% of ChatGPT’s performance and 68% of GPT- 4’s performance.
YuLan-Chat (13B) (YuLan-Chat-Team, 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on a constructed bilingual dataset, which contains 250,000 Chinese- English instruction pairs. After fine-tuning, YuLan-Chat-13B achieves comparable results to the state-of-the-art open-source model ChatGLM (6B) (Du et al., 2022), and outperforms Vicuna (13B) (Chiang et al., 2023) on the English BBH3K (BBH3K is a subset of BBH benchmark (Srivastava et al., 2022)) dataset.
TüLU (6.7B) (Wang等人,2023c)是在混合指令數(shù)據(jù)集上通過(guò)對(duì)OPT (6.7B) (Zhang等人,2022a)進(jìn)行微調(diào)而訓(xùn)練的大型語(yǔ)言模型,該數(shù)據(jù)集包含F(xiàn)LAN V2?(Longpre等人,2023)、CoT (Wei等人,2022)、Dolly (Conover等人,2023a)、Open Assistant-1、GPT4-Alpaca、Code-Alpaca (Chaudhary, 2023)和ShareGPT。經(jīng)過(guò)微調(diào),TüLU (6.7B)平均達(dá)到ChatGPT的83%和GPT- 4的68%的性能。
YuLan-Chat (13B) (YuLan-Chat- team, 2023)是通過(guò)微調(diào)LLaMA (13B) (Touvron et al., 2023a)在包含25萬(wàn)個(gè)中英文指令對(duì)的構(gòu)建雙語(yǔ)數(shù)據(jù)集上訓(xùn)練的語(yǔ)言模型。經(jīng)過(guò)微調(diào),YuLan-Chat-13B在英語(yǔ)BBH3K (BBH3K是BBH基準(zhǔn)(Srivastava et al., 2022)的一個(gè)子集)數(shù)據(jù)集上取得了與最先進(jìn)的開(kāi)源模型ChatGLM (6B) (Du等人,2022)相當(dāng)?shù)慕Y(jié)果,并且優(yōu)于Vicuna (13B) (Chiang等人,2023)。
MOSS:微調(diào)對(duì)話指令的雙語(yǔ)對(duì)話語(yǔ)言模型
Airoboros:基于LLaMA+微調(diào)Self-instruct數(shù)據(jù)集
UltraLM:基于LLAMA+微調(diào),
MOSS (16B) is a bilingual dialogue language model, which aims to engage in multi-turn conversations and utilize various plugins, trained by fine-tuning on dialogue instructions. After fine- tuning, MOSS outperforms the backbone model and generates responses that better align with human preferences.
Airoboros (13B) is a large language model trained by fine-tuning LLAMA (13B) (Touvron et al., 2023a) on the Self-instruct dataset (Wang et al., 2022c). After fine-tuning, Airoboros significantly outperforms LLAMA (13B) (Touvron et al., 2023a) on all benchmarks and achieves highly comparable results to models fine-tuned specifically for certain benchmarks.
UltraLM (13B) (Ding et al., 2023a) is a large language model trained by fine-tuning LLAMA (13B) (Touvron et al., 2023a). For evaluation, UltraLM (13B) outperforms Dolly (12B) (Conover et al., 2023a) and achieves the winning rate up to 98%. Additionally, it surpasses the previous best open-source models (i.e., Vicuna (Chiang et al., 2023) and WizardLM (Xu et al., 2023a)) with winning rates of 9% and 28%, respectively.?
MOSS(16B)是一種雙語(yǔ)對(duì)話語(yǔ)言模型,旨在進(jìn)行多輪對(duì)話并利用各種插件,在對(duì)話指令上進(jìn)行微調(diào)。微調(diào)后,MOSS優(yōu)于基準(zhǔn)模型,并生成與人類偏好更加一致的響應(yīng)。
Airoboros(13B)通過(guò)在Self-instruct數(shù)據(jù)集上進(jìn)行微調(diào),使用LLaMA(13B)(Touvron等,2023a)進(jìn)行初始化。微調(diào)后,Airoboros在所有基準(zhǔn)數(shù)據(jù)集上明顯優(yōu)于LLAMA(13B),并且與專門(mén)針對(duì)某些基準(zhǔn)測(cè)試進(jìn)行微調(diào)的模型取得了高度可比性的結(jié)果。
UltraLM(13B)(Ding等,2023a)通過(guò)對(duì)LLAMA(13B)(Touvron等,2023a)進(jìn)行微調(diào)獲得,微調(diào)后在性能上優(yōu)于Dolly(12B)(Conover等,2023a)并達(dá)到98%的勝率。此外,它在性能上超越了之前的最佳開(kāi)源模型(即Vicuna和WizardLM),其勝率分別為9%和28%。
5、Multi-modality Instruction Fine-tuning多模態(tài)指令微調(diào)
5.1、Multi-modality Datasets多模態(tài)數(shù)據(jù)集
MUL-TIINSTRUCT—多模態(tài)指令微調(diào)數(shù)據(jù)集—OFA模型:由62個(gè)不同的多模態(tài)任務(wù)組成+統(tǒng)一的序列到序列格式
MUL-TIINSTRUCT (Xu et al., 2022) is a multimodal instruction tuning dataset consisting of 62 diverse multimodal tasks in a unified seq- to-seq format. This dataset covers 10 broad categories and its tasks are derived from 21 existing open-sourced datasets. Each task is equipped with 5 expert-written instructions. For the existing tasks, the authors use the input/output pairs from their available open-source datasets to create instances. While for each new task, the authors create 5k to 5M instances by extracting the necessary information from instances of existing tasks or reformulating them. The MUL-TIINSTRUCT dataset has demonstrated its efficiency in enhancing various transfer learning technique. For example, fine-tuning the OFA model (930M) (Wang et al., 2022a) using various transfer learning strategies such as Mixed Instruction Tuning and Sequential Instruction Tuning on MUL-TIINSTRUCT improve the zero- shot performance across all unseen tasks. On commonsense VQA task, OFA fine-tuned on MUL- TIINSTRUCT achieves 50.60 on RougeL and 31.17 on accuracy, while original OFA achieves 14.97 on RougeL and 0.40 on accuracy.
MUL-TIINSTRUCT(Xu等,2022)是一個(gè)多模態(tài)指令微調(diào)數(shù)據(jù)集,由62個(gè)不同的多模態(tài)任務(wù)組成,以統(tǒng)一的序列到序列格式呈現(xiàn)。該數(shù)據(jù)集涵蓋10個(gè)廣泛的類別,其任務(wù)來(lái)自21個(gè)現(xiàn)有的開(kāi)源數(shù)據(jù)集。每個(gè)任務(wù)配備了5個(gè)專家編寫(xiě)的指令。
>> 對(duì)于現(xiàn)有任務(wù),作者使用其可用的開(kāi)源數(shù)據(jù)集中的輸入/輸出對(duì)創(chuàng)建實(shí)例。
>> 而對(duì)于每個(gè)新任務(wù),作者通過(guò)從現(xiàn)有任務(wù)的實(shí)例中提取必要信息或重新構(gòu)建它們來(lái)創(chuàng)建5k到5M個(gè)實(shí)例。
MUL-TIINSTRUCT數(shù)據(jù)集已經(jīng)證明在增強(qiáng)各種遷移學(xué)習(xí)技術(shù)方面的有效性。例如,使用Mixed Instruction Tuning和Sequential Instruction Tuning等各種遷移學(xué)習(xí)策略對(duì)OFA模型(930M)(Wang等,2022a)在MUL-TIINSTRUCT上進(jìn)行微調(diào),改進(jìn)了所有未見(jiàn)任務(wù)的零-shot性能。在常識(shí)視覺(jué)問(wèn)答任務(wù)上,經(jīng)過(guò)MUL-TIINSTRUCT微調(diào)的OFA在RougeL上達(dá)到50.60,在準(zhǔn)確性上達(dá)到31.17,而原始OFA在RougeL上只有14.97,在準(zhǔn)確性上只有0.40。
PMC-VQA—大規(guī)模的醫(yī)學(xué)視覺(jué)問(wèn)答數(shù)據(jù)集—MedVInT模型:227k個(gè)圖像-問(wèn)題對(duì)和149k個(gè)圖像,從PMC-OA收集圖像-標(biāo)題對(duì)+ChatGPT生成問(wèn)題-答案對(duì)+手工驗(yàn)證
PMC-VQA (Zhang et al., 2023c) is a large- scale medical visual question-answering dataset that comprises 227k image-question pairs of 149k images, covering various modalities or diseases. The dataset can be used for both open-ended and multiple-choice tasks. The pipeline for generating the PMC-VQA dataset involves collecting image-caption pairs from the PMC-OA (Lin et al., 2023) dataset, using ChatGPT to generate question-answer pairs, and manually verifying a subset of the dataset for quality. The authors propose a generative-based model MedVInT for medical visual understanding by aligning visual information with a large language model. MedVInT pretrained on PMC- VQA achieves state-of-the-art performance and outperforms existing models on VQA-RAD (Lau et al., 2018) and SLAKE (Liu et al., 2021a) benchmarks, with 81.6% accuracy on VQA-RAD and 88.0% accuracy on SLAKE.
PMC-VQA(Zhang等,2023c)是一個(gè)大規(guī)模的醫(yī)學(xué)視覺(jué)問(wèn)答數(shù)據(jù)集,包括227k個(gè)圖像-問(wèn)題對(duì)和149k個(gè)圖像,涵蓋了各種模態(tài)或疾病。該數(shù)據(jù)集可用于開(kāi)放式和多項(xiàng)選擇任務(wù)。生成PMC-VQA數(shù)據(jù)集的流程涉及從PMC-OA(Lin等,2023)數(shù)據(jù)集中收集圖像-標(biāo)題對(duì),使用ChatGPT生成問(wèn)題-答案對(duì),并對(duì)數(shù)據(jù)集的子集進(jìn)行手工驗(yàn)證以確保質(zhì)量。作者提出了一種基于生成的模型MedVInT,通過(guò)將視覺(jué)信息與大型語(yǔ)言模型進(jìn)行對(duì)齊,實(shí)現(xiàn)醫(yī)學(xué)視覺(jué)理解。在經(jīng)過(guò)PMC-VQA微調(diào)的MedVInT上實(shí)現(xiàn)了最新的性能,并在VQA-RAD(Lau等,2018)和SLAKE(Liu等,2021a)基準(zhǔn)上優(yōu)于現(xiàn)有模型,VQA-RAD上的準(zhǔn)確率為81.6%,SLAKE上的準(zhǔn)確率為88.0%。
LAMM—2D圖像和3D點(diǎn)云理解:包含186K個(gè)語(yǔ)言-圖像指令-響應(yīng)對(duì),以及10K個(gè)語(yǔ)言-點(diǎn)云指令-響應(yīng)對(duì)
LAMM (Yin et al., 2023) is a comprehensive multi-modal instruction tuning dataset for 2D image and 3D point cloud understanding. LAMM contains 186K language-image instruction- response pairs, and 10K language-point cloud instruction-response pairs. The authors collect images and point clouds from publicly available datasets and use the GPT-API and self-instruction methods to generate instructions and responses based on the original labels from these datasets. LAMM-Dataset includes data pairs for commonsense knowledge question answering by incorporating a hierarchical knowledge graph label system from the Bamboo (Zhang et al., 2022b) dataset and the corresponding Wikipedia description. The authors also propose the LAMM- Benchmark, which evaluates existing multi-modal language models (MLLM) on various computer vision tasks. It includes 9 common image tasks and 3 common point cloud tasks, and LAMM- Framework, a primary MLLM training framework that differentiates the encoder, projector, and LLM finetuning blocks for different modalities to avoid modality conflicts.
LAMM(Yin等,2023)是一個(gè)全面的多模態(tài)指令微調(diào)數(shù)據(jù)集,用于2D圖像和3D點(diǎn)云理解。LAMM包含186K個(gè)語(yǔ)言-圖像指令-響應(yīng)對(duì),以及10K個(gè)語(yǔ)言-點(diǎn)云指令-響應(yīng)對(duì)。作者從公開(kāi)可用的數(shù)據(jù)集中收集圖像和點(diǎn)云,并使用GPT-API和自我指導(dǎo)方法根據(jù)這些數(shù)據(jù)集的原始標(biāo)簽生成指令和響應(yīng)。LAMM-Dataset還包括了常識(shí)知識(shí)問(wèn)答的數(shù)據(jù)對(duì),通過(guò)將分層知識(shí)圖標(biāo)簽系統(tǒng)從Bamboo(Zhang等,2022b)數(shù)據(jù)集和相應(yīng)的維基百科描述整合進(jìn)來(lái)。作者還提出了LAMM-Benchmark,用于評(píng)估現(xiàn)有的多模態(tài)語(yǔ)言模型(MLLM)在各種計(jì)算機(jī)視覺(jué)任務(wù)上的性能。其中包括9個(gè)常見(jiàn)的圖像任務(wù)和3個(gè)常見(jiàn)的點(diǎn)云任務(wù),以及LAMM-Framework,一個(gè)主要的MLLM訓(xùn)練框架,用于為不同的模態(tài)區(qū)分編碼器、投影器和LLM微調(diào)模塊,以避免模態(tài)沖突。
5.2、Multi-modality Instruction Fine-tuning Models多模態(tài)指令微調(diào)模型
InstructPix2Pix條件擴(kuò)散模型:基于Stable Diffusion+微調(diào)多模態(tài)數(shù)據(jù)集(綜合兩大模型能力【GPT-3、Stable Diffusion】來(lái)生成)
InstructPix2Pix (983M) (Brooks et al., 2022) is a conditional diffusion model trained by fine-tuning Stable Diffusion (983M) (Rombach et al., 2022) on a constructed multi-modal dataset that contains more than 450K text editing instructions and corresponding images before and after the edit. The authors combine the abilities of two large-scale pre- trained models, a language model GPT-3 (Brown et al., 2020b) and a text-to-image model Stable Diffusion (Rombach et al., 2022), to generate the the training dataset. GPT-3 is fine-tuned to generate text edits based on image prompts, while Stable Diffusion is used to convert the generated text edits into actual image edits. InstructPix2Pix is then trained on this generated dataset using a latent diffusion objective. Figure 5 shows the process of generating image editing dataset and training the diffusion model on that dataset. The authors compares the proposed method qualitatively with previous works such as SDEdit (Meng et al., 2022) and Text2Live (Bar-Tal et al., 2022), highlighting the ability of the model to follow image editing instructions instead of descriptions of the image or edit layer. The authors also presents quantitative comparisons with SDEdit (Meng et al., 2022) using metrics measuring image consistency and edit quality.
InstructPix2Pix(983M)(Brooks等,2022)是一種條件擴(kuò)散模型,通過(guò)在構(gòu)建的多模態(tài)數(shù)據(jù)集上對(duì)Stable Diffusion(983M)(Rombach等,2022)進(jìn)行微調(diào)而訓(xùn)練得到,該數(shù)據(jù)集包含超過(guò)450K個(gè)文本編輯指令和相應(yīng)的編輯前后圖像。作者將兩個(gè)大規(guī)模預(yù)訓(xùn)練模型的能力結(jié)合在一起,即語(yǔ)言模型GPT-3(Brown等,2020b)和文本到圖像模型Stable Diffusion(Rombach等,2022),以生成訓(xùn)練數(shù)據(jù)集。GPT-3被微調(diào)以根據(jù)圖像提示生成文本編輯,而Stable Diffusion則用于將生成的文本編輯轉(zhuǎn)換為實(shí)際圖像編輯。然后,InstructPix2Pix在此生成的數(shù)據(jù)集上使用潛在擴(kuò)散目標(biāo)進(jìn)行訓(xùn)練。圖5展示了生成圖像編輯數(shù)據(jù)集的過(guò)程以及在該數(shù)據(jù)集上訓(xùn)練擴(kuò)散模型的過(guò)程。
作者將所提出的方法與之前的作品(如SDEdit和Text2Live)進(jìn)行了定性比較,強(qiáng)調(diào)該模型能夠按照?qǐng)D像編輯指令進(jìn)行操作,而不是圖像或編輯層的描述。作者還使用衡量圖像一致性和編輯質(zhì)量的指標(biāo)對(duì)其與SDEdit進(jìn)行了定量比較。
LLaVA:基于CLIP視覺(jué)編碼器和LLaMA語(yǔ)言解碼器模型+微調(diào)158K個(gè)獨(dú)特的語(yǔ)言-圖像指令-跟隨樣本的教學(xué)視覺(jué)語(yǔ)言數(shù)據(jù)集(利用GPT-4轉(zhuǎn)換格式)
LLaVA (13B) (Liu et al., 2023b) is a large multimodal model developed by connecting the visual encoder of CLIP (400M) (Radford et al., 2021) with the language decoder LLaMA (7B) (Touvron et al., 2023a). LLaVA is fine-tuned using the generated instructional vision-language dataset consisted of 158K unique language-image instruction-following samples. The data collection process involved creating conversation, detailed description, and complex reasoning prompts. GPT-4 is used to convert image-text pairs into appropriate instruction-following format for this dataset. Visual features such as captions and bounding boxes were used to encode images. LLaVA yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
LLaVA(13B)(Liu等,2023b)是一個(gè)大型多模態(tài)模型,通過(guò)將CLIP(400M)(Radford等,2021)的視覺(jué)編碼器與LLaMA(7B)(Touvron等,2023a)的語(yǔ)言解碼器相連接而開(kāi)發(fā)。LLaVA通過(guò)生成包含158K個(gè)獨(dú)特的語(yǔ)言-圖像指令-跟隨樣本的教學(xué)視覺(jué)語(yǔ)言數(shù)據(jù)集進(jìn)行微調(diào)。
數(shù)據(jù)收集過(guò)程涉及創(chuàng)建會(huì)話、詳細(xì)描述和復(fù)雜推理提示。使用GPT-4將圖像-文本對(duì)轉(zhuǎn)換為適用于此數(shù)據(jù)集的適當(dāng)?shù)闹噶罡S格式。使用標(biāo)題和邊界框等視覺(jué)特征來(lái)編碼圖像。LLaVA在合成多模態(tài)指令跟隨數(shù)據(jù)集上相對(duì)于GPT-4的得分為85.1%。在Science QA上進(jìn)行微調(diào)時(shí),LLaVA和GPT-4的協(xié)同作用實(shí)現(xiàn)了92.53%的新的最高準(zhǔn)確率。
Video-LLaMA多模態(tài)框架:由兩個(gè)分支編碼器組成(視覺(jué)-語(yǔ)言VL分支和音頻-語(yǔ)言AL分支+語(yǔ)言解碼器LLaMA)
Video-LLaMA (Zhang et al., 2023b) is a multimodal framework that enhances large language models with the ability to understand both visual and auditory content in videos. The architecture of Video-LLaMA consists of two branche encoders: the Vision-Language (VL) Branch and the Audio-Language (AL) Branch, and a language decoder (Vicuna (7B/13B) (Chiang et al., 2023), LLaMA (7B) (Touvron et al., 2023a), etc.). The VL Branch includes a frozen pre-trained image encoder (pre-trained vision component of BLIP-2 (Li et al., 2023d), which includes a ViT-G/14 and a pre-trained Q-former), a position embedding layer, a video Q-former and a linear layer. The AL Branch includes a pre- trained audio encoder (ImageBind (Girdhar et al., 2023)) and an Audio Q-former. Figure 6 shows the overall architecture of Video-LLaMA with Vision-Language Branch and Audio-Language Branch. The VL Branch is trained on the Webvid-2M (Bain et al., 2021) video caption dataset with a video-to-text generation task, and fine-tuned on the instruction-tuning data from MiniGPT-4 (Zhu et al., 2023), LLaVA (Liu et al., 2023b) and VideoChat (Li et al., 2023e). The AL Branch is trained on video/image instru- caption data to connect the output of ImageBind to language decoder. After finetuning, Video- LLaMA can perceive and comprehend video content, demonstrating its ability to integrate auditory and visual information, understand static images, recognize common-knowledge concepts, and capture temporal dynamics in videos.?
Video-LLaMA(Zhang等,2023b)是一個(gè)多模態(tài)框架,通過(guò)在視頻中理解視覺(jué)和聽(tīng)覺(jué)內(nèi)容來(lái)增強(qiáng)大型語(yǔ)言模型的能力。Video-LLaMA的架構(gòu)由兩個(gè)分支編碼器組成:視覺(jué)-語(yǔ)言(VL)分支和音頻-語(yǔ)言(AL)分支,以及一個(gè)語(yǔ)言解碼器(Vicuna(7B/13B)(Chiang等,2023),LLaMA(7B)(Touvron等,2023a)等)。
VL分支包括一個(gè)凍結(jié)的預(yù)訓(xùn)練圖像編碼器(BLIP-2的預(yù)訓(xùn)練視覺(jué)組件(Li等,2023d)),其中包括一個(gè)ViT-G/14和一個(gè)預(yù)訓(xùn)練的Q-former)、一個(gè)位置嵌入層、一個(gè)視頻Q-former和一個(gè)線性層。
AL分支包括一個(gè)預(yù)訓(xùn)練的音頻編碼器(ImageBind(Girdhar等,2023))和一個(gè)音頻Q-former。圖6展示了Video-LLaMA的整體架構(gòu),包括視覺(jué)-語(yǔ)言分支和音頻-語(yǔ)言分支。
VL分支在Webvid-2M(Bain等,2021)視頻字幕數(shù)據(jù)集上進(jìn)行訓(xùn)練,進(jìn)行視頻到文本生成任務(wù),并在來(lái)自MiniGPT-4(Zhu等,2023)、LLaVA(Liu等,2023b)和VideoChat(Li等,2023e)的指令微調(diào)數(shù)據(jù)上進(jìn)行微調(diào)。
AL分支在視頻/圖像指令-字幕數(shù)據(jù)上進(jìn)行訓(xùn)練,將ImageBind的輸出連接到語(yǔ)言解碼器。
微調(diào)后,Video-LLaMA能夠感知和理解視頻內(nèi)容,展示了其整合聽(tīng)覺(jué)和視覺(jué)信息、理解靜態(tài)圖像、識(shí)別常識(shí)概念以及捕捉視頻中的時(shí)間動(dòng)態(tài)的能力。
InstructBLIP視覺(jué)-語(yǔ)言指令微調(diào)框架:基于BLIP-2模型(圖像編碼器+LLM+Query Transformer)
InstructBLIP (1.2B) (Dai et al., 2023) is a vision-language instruction tuning framework initialized with a pre-trained BLIP-2 (Li et al., 2023d)) model consisting of an image encoder, an LLM (FlanT5 (3B/11B) (Chung et al., 2022) or Vicuna (7B/13B) (Chiang et al., 2023)), and a Query Transformer (Q-Former) to bridge the two. As shown in Figure 7, the Q-Former extracts instruction-aware visual features from the output embeddings of the frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM. The authors evaluate the proposed InstructBLIP model on a variety of vision- language tasks, including image classification, image captioning, image question answering, and visual reasoning. They use 26 publicly available datasets, dividing them into 13 held-in and 13 held-out datasets for training and evaluation. The authors demonstrate that InstructBLIP achieves state-of-the-art zero-shot performance on a wide range of vision-language tasks. InstructBLIP yields an average relative improvement of 15.0% when compared to BLIP-2, smallest InstructBLIP (4B) outperforms Flamingo (80B) (Alayrac et al., 2022) on all six shared evaluation datasets with an average relative improvement of 24.8%.
InstructBLIP(1.2B)(Dai等,2023)是一個(gè)視覺(jué)-語(yǔ)言指令微調(diào)框架,其初始化為一個(gè)預(yù)訓(xùn)練的BLIP-2(Li等,2023d)模型,包括圖像編碼器、LLM(FlanT5(3B/11B)(Chung等,2022)或Vicuna(7B/13B)(Chiang等,2023))和一個(gè)Query Transformer(Q-Former)以連接兩者。如圖7所示,Q-Former從凍結(jié)的圖像編碼器的輸出嵌入中提取指令感知的視覺(jué)特征,并將視覺(jué)特征作為軟提示輸入到凍結(jié)的LLM中。
作者在各種視覺(jué)-語(yǔ)言任務(wù)上評(píng)估了所提出的InstructBLIP模型,包括圖像分類、圖像字幕生成、圖像問(wèn)答和視覺(jué)推理。他們使用了26個(gè)公開(kāi)可用的數(shù)據(jù)集,將其分為13個(gè)用于訓(xùn)練和13個(gè)用于評(píng)估的數(shù)據(jù)集。作者證明InstructBLIP在各種視覺(jué)-語(yǔ)言任務(wù)上實(shí)現(xiàn)了最新的零-shot性能。相較于BLIP-2,InstructBLIP平均相對(duì)改進(jìn)15.0%,最小的InstructBLIP(4B)在六個(gè)共享評(píng)估數(shù)據(jù)集上優(yōu)于Flamingo(80B)(Alayrac等,2022),平均相對(duì)改進(jìn)為24.8%。
Otter:基于OpenFlamingo模型+只微調(diào)Perceiver重采樣模塊、交叉注意力層和輸入/輸出嵌入
Otter (Li et al., 2023b) is a multi-modal model trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023), with the language and vision encoders frozen and only fine-tuning the Perceiver resampler module, cross-attention layers, and input/output embeddings. The authors organize diverse multi-modal tasks covering 11 categories and build multi-modal in-context instruction tuning datasets MIMIC-IT of 2.8M multimodal instruction-response pairs, which consists of image- instruction-answer triplets, where the instruction- answer is tailored to the image. Each data sample also includes context, which contains a series of image-instruction-answer triplets that contextually correlate with the queried triplet. Otter demonstrates the ability to follow user instructions more accurately and provide more detailed descriptions of images compared to OpenFlamingo (Awadalla et al., 2023).
Otter(Li等,2023b)是一種多模態(tài)模型,通過(guò)微調(diào)OpenFlamingo(9B)(Awadalla等,2023),其中語(yǔ)言和視覺(jué)編碼器被凍結(jié),只微調(diào)了Perceiver重采樣模塊、交叉注意力層和輸入/輸出嵌入。作者組織了涵蓋11個(gè)類別的多樣多模態(tài)任務(wù),并構(gòu)建了包含2.8M個(gè)多模態(tài)指令-響應(yīng)對(duì)的多模態(tài)上下文指令微調(diào)數(shù)據(jù)集MIMIC-IT,其中包括圖像-指令-答案三元組,其中指令-答案適用于圖像。每個(gè)數(shù)據(jù)樣本還包括上下文,其中包含一系列與查詢的三元組在上下文上下文相關(guān)的圖像-指令-答案三元組。Otter相對(duì)于OpenFlamingo(Awadalla等,2023)能夠更準(zhǔn)確地遵循用戶指令,并提供與圖像相關(guān)的更詳細(xì)的描述。
MultiModal-GPT:多模態(tài)指令微調(diào)模型
MultiModal-GPT (Gong et al., 2023) is a multi- modal instruction tuning model that is capable of following diverse instructions, generating detailed captions, counting specific objects, and addressing general inquiries. MultiModal-GPT is trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023) on various created visual instruction data with open datasets, including VQA, Image Captioning, Visual Reasoning, Text OCR, and Visual Dialogue. The experiments demonstrate the proficiency of MultiModal-GPT in maintaining continuous dialogues with humans.
MultiModal-GPT(Gong等,2023)是一種多模態(tài)指令微調(diào)模型,能夠遵循不同的指令,生成詳細(xì)的標(biāo)題,計(jì)數(shù)特定的對(duì)象,并回答一般性問(wèn)題。MultiModal-GPT通過(guò)在包括VQA、圖像字幕生成、視覺(jué)推理、文本OCR和視覺(jué)對(duì)話等的各種創(chuàng)建的視覺(jué)指令數(shù)據(jù)上微調(diào)OpenFlamingo(9B)(Awadalla等,2023)而訓(xùn)練得到。實(shí)驗(yàn)展示了MultiModal-GPT在與人類保持持續(xù)對(duì)話方面的能力。
6、Domain-specific Instruction Finetuning特定領(lǐng)域指令微調(diào)
In this section, we describe instruction tuning in different domains and applications.
在本節(jié)中,我們描述了不同領(lǐng)域和應(yīng)用中的指令微調(diào)。
6.1、Dialogue對(duì)話—InstructDial、LINGUIST模型:每個(gè)任務(wù)實(shí)例{任務(wù)描述、實(shí)例輸入、約束、指令和輸出}+兩個(gè)元任務(wù)(指令選擇任務(wù)+指令二元任務(wù))
InstructDial (Gupta et al., 2022) is an instruction tuning framework designed for dialogue. It contains a collection of 48 dialogue tasks in a consistent text-to-text format created from 59 dialogue datasets. Each task instance includes a task description, instance inputs, constraints, instructions, and output. To ensure adherence to instructions, the framework introduces two meta- tasks: (1) an instruction selection task, where the model selects the instruction corresponding to a given input-output pair; and (2) an instruction binary task, where the model predicts "yes" or "no" if an instruction leads to a given output from an input. Two base models T0-3B (Sanh et al., 2021) (3B parameters version of T5 (Lester et al., 2021)) and BART0 (Lin et al., 2022) (406M parameters based on Bart-large (Lewis et al., 2019)) are fine- tuned on the tasks from InstructDial. InstructDial achieves impressive results on unseen dialogue datasets and tasks, including dialogue evaluation and intent detection. Moreover, it delivers even better results when applied to a few-shot setting.?
Intent Classification and Slot Tagging LINGUIST (Rosenbaum et al., 2022) finetunes AlexaTM 5B (Soltan et al., 2022), a 5-billion-parameter multilingual model, on the instruction dataset for intent classification and slot tagging tasks. Each instruction consists of five blocks: (i) the language of the generated output, (ii) intention, slot types and values to include in the output (e.g., the number 3 in [3, snow] corresponds the slot type, and snow is the value used for that slot), a mapping from slot type labels to numbers, and (v) up to 10 examples to instruct the format of the outputs. LINGUIST shows significant improvements over state-of-the-art approaches in a 10-shot novel intent setting using the SNIPS dataset (Coucke et al., 2018). In the zero-shot cross- lingual setting of the mATIS++ dataset (Xu et al., 2020), LINGUIST surpasses a strong baseline of Machine Translation with Slot Alignment across 6 languages while maintaining intent classification performance.
InstructDial(Gupta等,2022)是一個(gè)專為對(duì)話設(shè)計(jì)的指令微調(diào)框架。它包含一個(gè)由59個(gè)對(duì)話數(shù)據(jù)集創(chuàng)建的一致的文本到文本格式的48個(gè)對(duì)話任務(wù)集合。
每個(gè)任務(wù)實(shí)例包括任務(wù)描述、實(shí)例輸入、約束、指令和輸出。為了確保遵循指令,該框架引入了兩個(gè)元任務(wù):(1)指令選擇任務(wù),模型根據(jù)給定的輸入-輸出對(duì)選擇相應(yīng)的指令;
(2)指令二元任務(wù),模型預(yù)測(cè)如果一個(gè)指令將輸入轉(zhuǎn)化為給定的輸出,它將預(yù)測(cè)“是”或“否”。
兩個(gè)基本模型T0-3B(Sanh等,2021)(T5的3B參數(shù)版本(Lester等,2021))和BART0(Lin等,2022)(基于Bart-large(Lewis等,2019)的406M參數(shù))在來(lái)自InstructDial的任務(wù)上進(jìn)行微調(diào)。InstructDial在看不見(jiàn)的對(duì)話數(shù)據(jù)集和任務(wù)上取得了令人印象深刻的成績(jī),包括對(duì)話評(píng)估和意圖檢測(cè)。此外,當(dāng)應(yīng)用于少樣本設(shè)置時(shí),它甚至可以獲得更好的結(jié)果。
意圖分類和槽位標(biāo)記LINGUIST(Rosenbaum等,2022)對(duì)AlexaTM 5B(Soltan等,2022),一個(gè)50億參數(shù)的多語(yǔ)言模型進(jìn)行微調(diào),用于意圖分類和槽位標(biāo)記任務(wù)的指令數(shù)據(jù)集。每個(gè)指令由五個(gè)塊組成:
(i)生成輸出的語(yǔ)言,
(ii)意圖、槽位類型和要包含在輸出中的值(例如,[3, snow]中的數(shù)字3對(duì)應(yīng)于槽位類型,snow是用于該槽位的值),從槽位類型標(biāo)簽到數(shù)字的映射,
(v)最多10個(gè)示例,以指導(dǎo)輸出的格式。
LINGUIST在使用SNIPS數(shù)據(jù)集(Coucke等,2018)進(jìn)行10樣本新意圖設(shè)置時(shí),在零樣本跨語(yǔ)言的mATIS++數(shù)據(jù)集(Xu等,2020)中,LINGUIST在維持意圖分類性能的同時(shí),超越了機(jī)器翻譯與槽位對(duì)齊的強(qiáng)基線。
6.3、Information Extraction信息抽取—InstructUIE:基于FlanT5模型+指令微調(diào)的統(tǒng)一信息抽取(IE)框架+將IE任務(wù)轉(zhuǎn)化為seq2seq格式,每個(gè)任務(wù)實(shí)例四個(gè)屬性{任務(wù)指令、選項(xiàng)、文本、輸出}?
InstructUIE (Wang et al., 2023b) is a unified information extraction (IE) framework based on instruction tuning, which transforms IE tasks to the seq2seq format and solves them by fine- tuning 11B FlanT5 (Chung et al., 2022) on the constructed IT dataset. Figure 8 shows the overall architecture of InstructUIE. It introduces IE INSTRUCTIONS, a benchmark of 32 diverse information extraction datasets in a unified text-to- text format with expert-written instructions. Each task instance is delineated by four properties: task instruction, options, text, and output. Task instruction contains information such as the type of information to be extracted, the output structure format, and additional constraints or rules that need to be adhered to during the extraction process. Options refer to the output label constraints of a task. Text refers to the input sentence. Output is the sentence obtained by converting the original tags of the sample (e.g. "entity tag: entity span" for NER). In the supervised setting, InstructUIE performs comparably to BERT (Devlin et al.,2018) and outperforms the state-of-the-art and GPT3.5 (Brown et al., 2020a) in zero-shot settings.
InstructUIE(Wang等,2023b)是一個(gè)基于指令微調(diào)的統(tǒng)一信息抽取(IE)框架,它將IE任務(wù)轉(zhuǎn)化為seq2seq格式,并通過(guò)在構(gòu)建的IT數(shù)據(jù)集上微調(diào)11B FlanT5(Chung等,2022)來(lái)解決這些問(wèn)題。
圖8展示了InstructUIE的整體架構(gòu)。它引入了IE INSTRUCTIONS,這是一個(gè)由32個(gè)多樣的信息抽取數(shù)據(jù)集組成的基準(zhǔn),以統(tǒng)一的文本到文本格式呈現(xiàn),其中包含專家編寫(xiě)的指令。
每個(gè)任務(wù)實(shí)例由四個(gè)屬性描述:任務(wù)指令、選項(xiàng)、文本和輸出。
>> 任務(wù)指令包含諸如要提取的信息類型、輸出結(jié)構(gòu)格式以及在提取過(guò)程中需要遵循的附加約束或規(guī)則等信息。
>> 選項(xiàng)是任務(wù)的輸出標(biāo)簽約束。
>> 文本是輸入句子。
>> 輸出是通過(guò)將樣本的原始標(biāo)簽(例如,NER中的"實(shí)體標(biāo)簽:實(shí)體跨度")轉(zhuǎn)換為句子獲得的(實(shí)體標(biāo)簽為槽位標(biāo)簽,實(shí)體跨度為值)。
在監(jiān)督設(shè)置下,InstructUIE在零樣本設(shè)置中表現(xiàn)出色,與BERT(Devlin等,2018)相當(dāng),并在零樣本設(shè)置中超越了最先進(jìn)的和GPT3.5(Brown等,2020a)。
6.4、ABSA基于內(nèi)容的情感分析:基于T5模型
ABSA/Aspect-based Sentiment Analysis基于內(nèi)容的情感分析
Varia et al. (2022) propose a unified instruction tuning framework for solving Aspect-based Sentiment Analysis (ABSA) task based on a fine- tuned T5 (220M) (Raffel et al., 2019) model. The framework addresses multiple factorized sub- tasks that involve the four elements of ABSA, namely Aspect Term, Aspect Category, Opinion Term, and Sentiment. It treats these sub-tasks as a combination of five Question Answering (QA) tasks by transforming each sentence in the corpus using instruction templates provided for each task. For instance, one of the instruction templates used is "What are the aspect terms in the text:
$TEXT?". The framework showcases substantial improvement (8.29 F1 on average) over the state-of- the-art in few-shot learning scenarios and remains comparable in full fine-tuning scenarios.
Varia等(2022)提出了一個(gè)統(tǒng)一的指令微調(diào)框架,用于解決基于內(nèi)容的情感分析(ABSA)任務(wù),基于微調(diào)的T5(220M)(Raffel等,2019)模型。該框架處理涉及ABSA的四個(gè)元素的多個(gè)分解子任務(wù),即內(nèi)容術(shù)語(yǔ)、內(nèi)容類別、意見(jiàn)術(shù)語(yǔ)和情感。它將這些子任務(wù)視為五個(gè)問(wèn)答(QA)任務(wù)的組合,通過(guò)使用為每個(gè)任務(wù)提供的指令模板來(lái)轉(zhuǎn)化語(yǔ)料庫(kù)中的每個(gè)句子。例如,所使用的指令模板之一是"What are the aspect terms in the text: $TEXT?"。該框架在少樣本學(xué)習(xí)場(chǎng)景中展示了顯著的改進(jìn)(平均F1值為8.29),在完全微調(diào)場(chǎng)景中保持了可比性。
6.5、Writing寫(xiě)作
Writing-Alpaca-7B輔助寫(xiě)作:基于LLaMa-7B模型+微調(diào)寫(xiě)作指令數(shù)據(jù)集(EDITEVAL基準(zhǔn)的擴(kuò)展),四元組{通用序言用于指導(dǎo)任務(wù)完成的指令字段,提供要編輯的文本的輸入字段,要求模型填寫(xiě)的響應(yīng)字段}
Zhang et al. (2023d)?propose Writing-Alpaca- 7B that fine-tunes LLaMa-7B on the writing instruction dataset to provide writing assistance. The proposed instruction dataset is an extension of the EDITEVAL benchmark based on instructional data, with the Updating task removed and a task for grammaticality introduced. The instruction scheme strictly follows the one in the Stanford Alpaca project, comprising a universal preface, an instruction field to guide task completion, an input field that provides the text to be edited, and a response field that requires models to fill out. The Writing-Alpaca-7B improves upon LLaMa’s performance on all writing tasks and outperforms other larger off-the-shelf LLMs.
Zhang等(2023d)提出了Writing-Alpaca-7B,通過(guò)對(duì)寫(xiě)作指令數(shù)據(jù)集進(jìn)行LLaMa-7B的微調(diào),以提供寫(xiě)作輔助。所提出的指令數(shù)據(jù)集是基于指導(dǎo)性數(shù)據(jù)的EDITEVAL基準(zhǔn)的擴(kuò)展,刪除了更新任務(wù)并引入了一個(gè)用于語(yǔ)法的任務(wù)。
指令方案嚴(yán)格遵循斯坦福Alpaca項(xiàng)目中的方案,包括通用序言、用于指導(dǎo)任務(wù)完成的指令字段、提供要編輯的文本的輸入字段和要求模型填寫(xiě)的響應(yīng)字段。Writing-Alpaca-7B在所有寫(xiě)作任務(wù)上均優(yōu)于LLaMa,并在其他更大的現(xiàn)成LLM上取得了更好的表現(xiàn)。
CoEdIT輔助寫(xiě)作:基于對(duì)FLANT模型+微調(diào)在文本編輯的指令數(shù)據(jù)集,兩元組{指令:源,目標(biāo)}
CoEdIT (Raheja et al., 2023) finetunes FLANT5 (770M parameters, 3B parameters, and 11B parameters) on the instruction dataset for text editing to provide writing assistance. The instruction dataset comprises approximately 82K<instruction: source, target> pairs. As shown in Figure 9, the model takes instructions from the user specifying the characteristics of the desired text, such as "Make the sentence simpler", and outputs the edited text. CoEdIT achieves state-of-the-art performance on several text editing tasks, including grammatical error correction, text simplification, iterative text editing, and three stylistic editing tasks: formality style transfer, neutralization, and paraphrasing. Furthermore, it can generalize well to new, adjacent tasks not seen during fine-tuning.
CoEdIT(Raheja等,2023)對(duì)FLANT5(770M參數(shù)、3B參數(shù)和11B參數(shù))在文本編輯的指令數(shù)據(jù)集上進(jìn)行微調(diào),以提供寫(xiě)作輔助。
指令數(shù)據(jù)集包括約82K個(gè)<指令:源,目標(biāo)>對(duì)。
如圖9所示,模型從用戶處獲取指令,指定所需文本的特性,例如"使句子更簡(jiǎn)單",然后輸出編輯后的文本。
CoEdIT在多個(gè)文本編輯任務(wù)上取得了最先進(jìn)的性能,包括語(yǔ)法錯(cuò)誤糾正、文本簡(jiǎn)化、迭代文本編輯以及三個(gè)風(fēng)格編輯任務(wù):正式風(fēng)格轉(zhuǎn)換、中性化和改寫(xiě)。此外,它還可以很好地推廣到新的、相鄰的任務(wù),這些任務(wù)在微調(diào)過(guò)程中未曾見(jiàn)過(guò)。
CoPoet協(xié)作的詩(shī)歌寫(xiě)作工具:基于T5模型+微調(diào)詩(shī)歌寫(xiě)作數(shù)據(jù)集,兩元組{指令,詩(shī)行}
CoPoet (Chakrabarty et al., 2022) is a collaborative poetry writing tool that utilizes a large language model (e.g. T5-3B, T5-11B and T0-3B models) trained on a diverse collection of instructions for poetry writing. Each sample in the instruction dataset includes an <instruction, poem_line> pair. There are three major types of instructions: Continuation, Lexical Constraints, and Rhetorical Techniques. The CoPoet is guided by user instructions that specify desired attributes of the poetry, such as writing a sentence about "love" or ending a sentence with "fly." Not only is the system competitive with publicly available LLMs trained on instructions, such as InstructGPT, but it is also capable of satisfying unseen compositional instructions.
CoPoet(Chakrabarty等,2022)是一個(gè)協(xié)作的詩(shī)歌寫(xiě)作工具,利用大型語(yǔ)言模型(如T5-3B、T5-11B和T0-3B模型)在詩(shī)歌寫(xiě)作的各種指導(dǎo)下進(jìn)行訓(xùn)練。指導(dǎo)性數(shù)據(jù)集中的每個(gè)樣本都包括一個(gè)<指令,詩(shī)行>對(duì)。有三種主要類型的指導(dǎo):延續(xù)、詞匯約束和修辭技巧。
CoPoet根據(jù)用戶的指令進(jìn)行指導(dǎo),指定詩(shī)歌的所需屬性,例如寫(xiě)一個(gè)關(guān)于"愛(ài)"的句子或以"飛"結(jié)尾的句子。該系統(tǒng)不僅在公開(kāi)可用的受指導(dǎo)訓(xùn)練的LLM方面具有競(jìng)爭(zhēng)力,例如InstructGPT,還能夠滿足未見(jiàn)過(guò)的組合指導(dǎo)。
6.6、Medical醫(yī)學(xué)
Radiology-GPT針對(duì)放射學(xué)領(lǐng)域:基于Alpaca+微調(diào)放射學(xué)領(lǐng)域知識(shí)數(shù)據(jù)集,兩元組{發(fā)現(xiàn),結(jié)論}
Radiology-GPT (Liu et al., 2023c) is a fine-tuned Alpaca-7B model for radiology, which utilizes an instruction tuning approach on an extensive dataset of radiology domain knowledge. Radiology reports usually include two corresponding sections: "Findings" and "Impression". The "Findings" section contains detailed observations from the radiology images, while the "Impression" section summarizes the interpretations drawn from those observations. Radiology-GPT provides a brief instruction to the "Findings" text: "Derive the impression from findings in the radiology report". The "Impression" text from the same report serves as the target output. In comparison to general language models such as StableLM, Dolly, and LLaMA, Radiology-GPT demonstrates significant versatility in radiological diagnosis, research, and communication.
Radiology-GPT(Liu等,2023c)是一個(gè)針對(duì)放射學(xué)領(lǐng)域的Alpaca-7B模型進(jìn)行微調(diào)的模型,它在廣泛的放射學(xué)領(lǐng)域知識(shí)數(shù)據(jù)集上采用了指令微調(diào)方法。放射學(xué)報(bào)告通常包括兩個(gè)相應(yīng)的部分:"發(fā)現(xiàn)"和"結(jié)論"。"發(fā)現(xiàn)"部分包含來(lái)自放射學(xué)圖像的詳細(xì)觀察,而"結(jié)論"部分總結(jié)了從這些觀察中得出的解釋。Radiology-GPT為"發(fā)現(xiàn)"文本提供了一個(gè)簡(jiǎn)要的指令:"從放射學(xué)報(bào)告的發(fā)現(xiàn)中得出結(jié)論"。同一份報(bào)告中的"結(jié)論"文本被用作目標(biāo)輸出。與StableLM、Dolly和LLaMA等通用語(yǔ)言模型相比,Radiology-GPT在放射學(xué)診斷、研究和交流方面表現(xiàn)出顯著的多樣性。
ChatDoctor:基于LLaMA模型+微調(diào)Alpaca指令數(shù)據(jù)集和HealthCareMagic100k患者-醫(yī)生對(duì)話數(shù)據(jù)集且檢索外部知識(shí)數(shù)據(jù)庫(kù)
ChatDoctor (Li et al., 2023g) is based on the fine-tuned LLaMA-7B model, utilizing the alpaca instruction dataset and the HealthCareMagic100k patient-doctor dialogue dataset. And prompt templates are designed for retrieving external knowledge databases, such as the Disease Database and Wikipedia retrieval, during doctor-patient conversations to obtain more accurate outputs from the model. The ChatDoctor significantly improves the model’sability to comprehend patient needs and provide informed advice. By equipping the model with self-directed information retrieval from reliable online and offline sources, the accuracy of its responses is substantially improved.?
ChatDoctor(Li等,2023g)基于經(jīng)過(guò)微調(diào)的LLaMA-7B模型,利用Alpaca指令數(shù)據(jù)集和HealthCareMagic100k患者-醫(yī)生對(duì)話數(shù)據(jù)集。并且在醫(yī)生-患者對(duì)話期間為檢索外部知識(shí)數(shù)據(jù)庫(kù),如疾病數(shù)據(jù)庫(kù)和維基百科檢索,設(shè)計(jì)了提示模板,以從模型中獲取更準(zhǔn)確的輸出。ChatDoctor顯著提高了模型理解患者需求并提供明智建議的能力。通過(guò)為模型配備從可靠的在線和離線來(lái)源自主獲取信息的能力,其回答的準(zhǔn)確性大大提高。
ChatGLM-Med:基于ChatGLM模型+微調(diào)中國(guó)醫(yī)學(xué)指令數(shù)據(jù)集(基于GPT3.5的API和醫(yī)學(xué)知識(shí)圖譜創(chuàng)建問(wèn)題-答案對(duì))
ChatGLM-Med?(Haochun Wang, 2023) is fine- tuned on the Chinese medical instruction dataset based on the ChatGLM-6B model. The instruction dataset comprises medically relevant question and answer pairs, created using the GPT3.5 API and the Medical Knowledge Graph. This model improves the question-answering performance of ChatGLM in the medical field.
ChatGLM-Med(Haochun Wang,2023)在基于ChatGLM-6B模型的中國(guó)醫(yī)學(xué)指令數(shù)據(jù)集上進(jìn)行了微調(diào)。指令數(shù)據(jù)集包括使用GPT3.5 API和醫(yī)學(xué)知識(shí)圖譜創(chuàng)建的與醫(yī)學(xué)相關(guān)的問(wèn)題和答案對(duì)。該模型提高了ChatGLM在醫(yī)學(xué)領(lǐng)域的問(wèn)答性能。
6.7、Arithmetic算術(shù):Goat=基于LLaMA模型+微調(diào)算術(shù)問(wèn)題數(shù)據(jù)集(ChatGPT生成數(shù)百個(gè)指令+自然語(yǔ)言問(wèn)答的形式表達(dá))
Goat (Liu and Low, 2023) is a fine-tuned LLaMA-7B model based on instructions, which aims to solve arithmetic problems. It expresses arithmetic problems in the form of natural language question answering, such as "What is 8914/64?", by generating hundreds of instruction templates using ChatGPT. The model applies various techniques to enhance its adaptability to diverse question formats, such as randomly removing spaces between numbers and symbols in the arithmetic expression and replacing "*" with "x" or "times". The Goat model achieves state-of-the-art performance on the BIG-bench arithmetic subtask. In particular, zero-shot Goat7B matches or exceeds the accuracy achieved by the few-shot PaLM-540B.
Goat(Liu和Low,2023)是一個(gè)基于指令微調(diào)的LLaMA-7B模型,旨在解決算術(shù)問(wèn)題。它通過(guò)使用ChatGPT生成數(shù)百個(gè)指令模板,以自然語(yǔ)言問(wèn)答的形式表達(dá)算術(shù)問(wèn)題,
例如"What is 8914/64?"。該模型應(yīng)用各種技術(shù)增強(qiáng)其適應(yīng)各種問(wèn)題格式的能力,例如隨機(jī)刪除算術(shù)表達(dá)式中數(shù)字和符號(hào)之間的空格,將"*"替換為"x"或"times"等。Goat模型在BIG-bench算術(shù)子任務(wù)上達(dá)到了最先進(jìn)的性能。特別是,零樣本的Goat7B的準(zhǔn)確性達(dá)到或超過(guò)了少樣本的PaLM-540B的準(zhǔn)確性。
6.8、Code代碼:WizardCoder=基于StarCoder模型+Evol-Instruct方法+微調(diào)Code Alpaca數(shù)據(jù)集,3元組{指令、輸入、期望輸出}
WizardCoder (Luo et al., 2023) utilizes StarCoder 15B as the foundation with complex instruction fine-tuning, by adapting the Evol- Instruct method (Xu et al., 2023) to the domain of code. The training dataset is produced through iterative application of the Evol-Instruct technique on the Code Alpaca dataset, which includes the following attributes for each sample: instruction, input, and expected output. For instance, when the instruction is "Amend the following SQL query to select distinct elements", the input is the SQL query, and the expected output is the generated answer. The WizardCoder outperforms all other open-source Code LLMs and even outperforms the largest LLMs, Anthropic’s Claude and Google’s Bard, on HumanEval and HumanEval+.
WizardCoder(Luo等,2023)以StarCoder?15B為基礎(chǔ),采用復(fù)雜指令微調(diào),將Evol-Instruct方法(Xu等,2023)適用于代碼領(lǐng)域。訓(xùn)練數(shù)據(jù)集通過(guò)在Code Alpaca數(shù)據(jù)集上迭代應(yīng)用Evol-Instruct技術(shù)產(chǎn)生,該數(shù)據(jù)集為每個(gè)樣本包括以下屬性:指令、輸入和期望輸出。
例如,當(dāng)指令為"Amend the following SQL query to select distinct elements"時(shí),輸入為SQL查詢,期望輸出為生成的答案。WizardCoder在HumanEval和HumanEval+上超越了所有其他開(kāi)源代碼LLM,甚至在HumanEval和HumanEval+上也超越了最大的LLM,Anthropic的Claude和Google的Bard。
LLMs之Code:SQLCoder的簡(jiǎn)介、安裝、使用方法之詳細(xì)攻略
LLMs之Code:SQLCoder的簡(jiǎn)介、安裝、使用方法之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
LLMs之Code:Code Llama的簡(jiǎn)介、安裝、使用方法之詳細(xì)攻略
LLMs之Code:Code Llama的簡(jiǎn)介、安裝、使用方法之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
補(bǔ)充—6.9、法律行業(yè)
LLMs之Law:大語(yǔ)言模型領(lǐng)域行業(yè)場(chǎng)景應(yīng)用之大模型法律行業(yè)的簡(jiǎn)介、主流LLMs(PowerLawGLM/ChatLaw)、經(jīng)典應(yīng)用之詳細(xì)攻略
LLMs之Law:大語(yǔ)言模型領(lǐng)域行業(yè)場(chǎng)景應(yīng)用之大模型法律行業(yè)的簡(jiǎn)介、主流LLMs(PowerLawGLM/ChatLaw)、經(jīng)典應(yīng)用之詳細(xì)攻略_一個(gè)處女座的程序猿的博客-CSDN博客
7、Efficient Tuning Techniques高效微調(diào)技術(shù)
7.0、高效微調(diào)三種方法論:基于添加式(引入額外可訓(xùn)練參數(shù)或模塊,如HINT)、基于規(guī)范化(凍結(jié)某些固有模型參數(shù)同時(shí)指定要調(diào)整的參數(shù),如Delta-tuning)、基于重參數(shù)化(假設(shè)模型自適應(yīng)的低秩性→權(quán)重可重新參數(shù)化為低維子空間,如LoRA/QLoRA/LOMO)
Efficient fine-tuning techniques aim at adapting LLMs to downstream tasks by optimizing a small fraction of parameters in multiple ways, i.e., addition-based, specification-based, and Reparameterization-based. Addition-based methods introduce extra trainable parameters or modules not present in the original model. Representative methods include adapter tuning (Houlsby et al., 2019) and prompt-based tuning (Schick and Schütze, 2021). Specification-based methods specify certain inherent model parameters to be tuned while freezing others. For example, BitFit (Zaken et al., 2022) tunes the bias terms of the pre-trained model. Reparameterization methods transform model weights into more parameter-efficient forms for tuning. The key hypothesis is that model adaptation is low-rank, so weights can be reparameterized into low- rank factors or a low-dimensional subspace (e.g., LoRA (Hu et al., 2021)). Intrinsic prompt tuning finds a low-dimensional subspace shared by tuning prompts across diverse tasks.
高效微調(diào)技術(shù)旨在通過(guò)多種方式對(duì)少量參數(shù)進(jìn)行優(yōu)化,從而將LLM適應(yīng)于下游任務(wù),包括基于添加式、基于規(guī)范化和基于重參數(shù)化的方法。基于添加式的方法引入了在原始模型中不存在的額外可訓(xùn)練參數(shù)或模塊。代表性的方法包括Adapter微調(diào)(Houlsby等,2019)和基于Prompt的微調(diào)(Schick和Schütze,2021)。基于規(guī)范化的方法在凍結(jié)某些固有模型參數(shù)的同時(shí),指定要調(diào)整的參數(shù)。例如,BitFit(Zaken等,2022)微調(diào)預(yù)訓(xùn)練模型的偏差項(xiàng)。基于重參數(shù)化方法將模型權(quán)重轉(zhuǎn)換為更加參數(shù)高效的形式進(jìn)行微調(diào)。關(guān)鍵假設(shè)是模型的自適應(yīng)是低秩的,因此權(quán)重可以重新參數(shù)化為低秩因子或低維子空間(例如LoRA(Hu等,2021))。Intrinsic prompt內(nèi)在的提示微調(diào)在不同任務(wù)之間找到了一種共享的低維子空間。
7.1、基于重參數(shù)化—LoRA=基于DeepSpeed框架+訓(xùn)練低維度的A和B→可訓(xùn)練參數(shù)比完全微調(diào)少得多(LoRA訓(xùn)練GPT-3可降低到千分之一)
Low-Rank Adaptation (LoRA) (Hu et al., 2021) enables efficient adaptation of LLMs using low- rank updates. LoRA use DeepSpeed (Rasley et al., 2020) as the training backbone. The key insight of LoRA is that the actual change in LLMs’ weights required for new task adaptation lies in a low- dimensional subspace. Specifically, for a pretrained weight matrix W0, the authors model the adapted weight matrix as W0 + ?W , where ?W is a low rank update. ?W is parameterized as ?W = BA, where A and B are much smaller trainable matrices. The rank r of ?W is chosen to be much smaller than the dimensions of W0. The intuition is that instead of directly training all of W0, the authors train low-dimensional A and B, which indirectly trains W0 in a low-rank subspace of directions that matter for the downstream task. This results in far fewer trainable parameters compared to full fine- tuning. For GPT-3, LoRA reduces the number of trainable parameters by 10,000x and memory usage by 3x compared to full fine-tuning.?
低秩適應(yīng)(LoRA)(Hu等,2021)使用低秩更新實(shí)現(xiàn)了LLM的高效適應(yīng)。LoRA使用DeepSpeed(Rasley等,2020)作為訓(xùn)練骨干。LoRA的關(guān)鍵洞察是,用于新任務(wù)適應(yīng)的LLM權(quán)重的實(shí)際變化位于低維子空間中。
具體而言,對(duì)于預(yù)訓(xùn)練權(quán)重矩陣W0,作者將適應(yīng)權(quán)重矩陣建模為
W0 + ?W,
其中?W是低秩更新。?W的參數(shù)化形式為?W = BA,其中A和B是較小的可訓(xùn)練矩陣。?W的秩r被選擇為遠(yuǎn)小于W0的維度。
直覺(jué)是,作者不是直接訓(xùn)練所有W0,而是訓(xùn)練低維度的A和B,這間接地在對(duì)下游任務(wù)重要的低秩子空間中訓(xùn)練W0。這導(dǎo)致可訓(xùn)練參數(shù)比完全微調(diào)少得多。對(duì)于GPT-3,LoRA將可訓(xùn)練參數(shù)的數(shù)量減少了10000倍,內(nèi)存使用量降低了3倍,與完全微調(diào)相比。
7.2、基于添加式—HINT=添加易于微調(diào)的模塊(基于超網(wǎng)絡(luò)數(shù)生成器生成適配器和前綴參數(shù))+插入到骨干模型作為高效的微調(diào)模塊
HINT屬于Addition-based方法。它通過(guò)添加易于微調(diào)的模塊(如適配器和前綴)來(lái)實(shí)現(xiàn)微調(diào),這些模塊沒(méi)有包含在原始模型結(jié)構(gòu)中,屬于添加額外的參數(shù)或模塊來(lái)實(shí)現(xiàn)微調(diào)。
HINT (Ivison et al., 2022) combines the generalization benefits of instruction tuning with efficient on-demand fine-tuning, avoiding repeatedly processing lengthy instructions. The essence of HINT lies in hypernetworks, which generate parameter-efficient modules for LLMs adaptation based on natural language instructions and few-shot examples. The adopted hypernetwork converts instructions and few-shot examples into a encoded instruction and generates adapter and prefix parameters using a pretrained text encoder and cross-attention based parameter generator. Then, the generated adapters and prefixes are inserted into the backbone model as efficient tuning modules. At inference, the hypernetwork performs inference only once per task to generate adapted modules. The benefits are that HINT can incorporate long instructions and additional few- shots without increasing compute, unlike regular fine-tuning or input concatenation methods.???
HINT(Ivison等,2022)將指令微調(diào)的泛化優(yōu)勢(shì)與高效的按需微調(diào)相結(jié)合,避免重復(fù)處理冗長(zhǎng)的指令。HINT的核心在于超網(wǎng)絡(luò),它基于自然語(yǔ)言指令和少樣本示例為L(zhǎng)LM適應(yīng)生成參數(shù)高效的模塊。采用的超網(wǎng)絡(luò)將指令和少樣本示例轉(zhuǎn)化為編碼指令,并使用預(yù)訓(xùn)練文本編碼器和基于交叉注意力的參數(shù)生成器生成適配器和前綴參數(shù)。然后,生成的適配器和前綴被插入到骨干模型中作為高效的微調(diào)模塊。在推理時(shí),超網(wǎng)絡(luò)僅執(zhí)行一次推理以生成適應(yīng)的模塊。好處是,HINT可以在不增加計(jì)算的情況下融入長(zhǎng)指令和額外的少樣本,不像常規(guī)微調(diào)或輸入連接方法。
7.3、基于重參數(shù)化—QLoRA=LoRA的量化版+NF4+雙量化DQ+分頁(yè)優(yōu)化器PO
QLORA (Dettmers et al., 2023) includes optimal quantization and memory optimization, aiming at providing efficient and effective LLMs fine- tuning. QLORA includes 4-bit NormalFloat (NF4) Quantization, which is a quantization scheme optimized for the typical normal distribution of LLM weights. By quantizing based on the quantiles of a normal distribution, NF4 provides better performance than standard 4-bit integer or float quantization. To further reduce memory, the quantization constants are themselves quantized to 8 bits. This second level of quantization saves an additional 0.37 bits per parameter on average. QLORA leverages NVIDIA’s unified memory feature to page optimizer states to CPU RAM when GPU memory is exceeded. avoiding out-of-memory during training. QLORA enables training a 65B parameter LLM on a single 48GB GPU with no degradation compared to full 16- bit finetuning. QLORA works by freezing the 4-bit quantized base LLM, then backpropagating through it into a small set of 16-bit low-rank adapter weights which are learned.
QLORA(Dettmers等,2023)包括最佳量化和內(nèi)存優(yōu)化,旨在提供高效有效的LLM微調(diào)。QLORA包括4位NormalFloat(NF4)量化,這是一種針對(duì)LLM權(quán)重的典型正態(tài)分布優(yōu)化的量化方案。通過(guò)基于正態(tài)分布的分位數(shù)進(jìn)行量化,NF4的性能優(yōu)于標(biāo)準(zhǔn)的4位整數(shù)或浮點(diǎn)數(shù)量化。為了進(jìn)一步減少內(nèi)存,量化常數(shù)本身被量化為8位。這第二層量化平均可節(jié)省每個(gè)參數(shù)0.37位的內(nèi)存。QLORA利用NVIDIA的統(tǒng)一內(nèi)存功能,當(dāng)GPU內(nèi)存超出限制時(shí),將優(yōu)化器狀態(tài)分頁(yè)到CPU RAM中,避免訓(xùn)練期間的內(nèi)存不足。QLORA可以在單個(gè)48GB GPU上訓(xùn)練65B參數(shù)的LLM,與完全16位微調(diào)相比沒(méi)有降級(jí)。QLORA的工作方式是凍結(jié)4位量化的基礎(chǔ)LLM,然后通過(guò)反向傳播將其傳播到一小組16位低秩適配器權(quán)重中進(jìn)行學(xué)習(xí)。
7.4、基于重參數(shù)化—LOMO=降低梯度內(nèi)存需求(融合梯度計(jì)算與參數(shù)更新+實(shí)時(shí)只存儲(chǔ)單個(gè)參數(shù)的梯度)+穩(wěn)定訓(xùn)練(梯度值裁剪+分離梯度范數(shù)計(jì)算+態(tài)損失縮放)+節(jié)省內(nèi)存(激活檢查點(diǎn)+ZeRO優(yōu)化)
LOMO屬于Reparameterization-based方法。LOMO通過(guò)將梯度計(jì)算和參數(shù)更新融合到一個(gè)步驟中,來(lái)避免存儲(chǔ)完整的梯度張量,從而實(shí)現(xiàn)只存儲(chǔ)單個(gè)參數(shù)梯度的能力,從而更高效地進(jìn)行微調(diào)。這屬于使用參數(shù)重參數(shù)化的方法來(lái)實(shí)現(xiàn)更高效的微調(diào)。
LOw-Memory Optimization (LOMO) (Lv et al., 2023) enables full parameter fine-tuning of LLMs using limited computational resources through a fusion of gradient computation and update. The essence is to fuse gradient computation and parameter update into one step during backpropagation, thereby avoiding storage of full gradient tensors. Firstly, theoretical analysis is provided in LOMO on why SGD can work well for fine-tuning large pre-trained models despite its challenges on smaller models. In addition, LOMO updates each parameter tensor immediately after computing its gradient in backpropagation. Storing the gradient of one parameter at a time reduces gradient memory to O(1). LOMO employs gradient value clipping, separate gradient norm computation pass and dynamic loss scaling to stabilize training. The integration of activation checkpointing and ZeRO optimization methods saves memory.
低內(nèi)存優(yōu)化(LOMO)(Lv等,2023)通過(guò)梯度計(jì)算和更新的融合,在有限的計(jì)算資源下實(shí)現(xiàn)LLM的全參數(shù)微調(diào)。其核心是在反向傳播期間將梯度計(jì)算和參數(shù)更新融合為一步,從而避免存儲(chǔ)完整的梯度張量。首先,LOMO在理論上分析了為什么SGD可以在微調(diào)大型預(yù)訓(xùn)練模型時(shí)表現(xiàn)良好,盡管在較小的模型上可能存在挑戰(zhàn)。此外,LOMO在反向傳播中在計(jì)算梯度后立即更新每個(gè)參數(shù)張量。一次只存儲(chǔ)一個(gè)參數(shù)的梯度將梯度內(nèi)存降低到O(1)。LOMO采用梯度值裁剪、單獨(dú)的梯度范數(shù)計(jì)算傳遞和動(dòng)態(tài)損失縮放來(lái)穩(wěn)定訓(xùn)練。激活檢查點(diǎn)和ZeRO優(yōu)化方法的集成可節(jié)省內(nèi)存。
7.5、基于規(guī)范化—Delta-tuning=優(yōu)化和最優(yōu)控制視角+將微調(diào)限制在低維流形上來(lái)執(zhí)行子空間優(yōu)化+微調(diào)參數(shù)充當(dāng)最優(yōu)控制器+在下游任務(wù)中引導(dǎo)模型行為
Delta-tuning屬于Specification-based方法。Delta-tuning通過(guò)限制微調(diào)在一個(gè)低維子空間上進(jìn)行,來(lái)指定預(yù)訓(xùn)練模型中的某些固有參數(shù)進(jìn)行微調(diào),而凍結(jié)其他參數(shù)。這屬于指定模型參數(shù)子集進(jìn)行微調(diào)的Specification-based方法。
Delta-tuning (Ding et al., 2023b) provides optimization and optimal control perspectives for theoretical analyzation. Intuitively, delta-tuning performs subspace optimization by restricting tuning to a low-dimensional manifold. The tuned parameters act as optimal controllers guiding model behavior on downstream tasks.
Delta-tuning(Ding等,2023b)提供了優(yōu)化和最優(yōu)控制的理論分析視角。直觀地說(shuō),Delta-tuning通過(guò)將調(diào)整限制在低維流形上來(lái)執(zhí)行子空間優(yōu)化。調(diào)整的參數(shù)充當(dāng)引導(dǎo)模型在下游任務(wù)中行為的最優(yōu)控制器。
8、Evaluation, Analysis and Criticism評(píng)估、分析和批評(píng)
8.1、HELM Evaluation:整體評(píng)估+提高LM透明度+關(guān)注三因素(廣泛性+多指標(biāo)性+標(biāo)準(zhǔn)化)
HELM(Liang et al., 2022) is a holistic evaluation of Language Models (LMs) to improve the transparency of language models, providing a more comprehensive understanding of the capabilities, risks, and limitations of language models. Specifically, differing from other evaluation methods, HELM holds that a holistic evaluation of language models should focus on the following three factors:
HELM(Liang等,2022)是對(duì)語(yǔ)言模型(LMs)進(jìn)行整體評(píng)估,旨在提高語(yǔ)言模型的透明度,從而更全面地了解語(yǔ)言模型的能力、風(fēng)險(xiǎn)和限制。與其他評(píng)估方法不同,HELM認(rèn)為對(duì)語(yǔ)言模型進(jìn)行整體評(píng)估應(yīng)關(guān)注以下三個(gè)因素:
(1)、Broad coverage. During the development, language models can be adapted to various NLP tasks (e.g., sequence labeling and question answering), thus, the evaluation of language models needs to be carried out in a wide range of scenarios. To involve all potential scenarios,
HELM proposed a top-down taxonomy, which begins by compiling all existing tasks in a major NLP conference (ACL2022) into a task space and dividing each task into the form of scenarios (e.g., languages) and metrics (e.g., accuracy). Then when facing a specific task, the taxonomy would select one or more scenarios and metrics in the task space to cover it. By analyzing the structure of each task, HELM clarifies the evaluation content (task scenarios and metrics) and improves the scenario coverage of language models from 17.9% to 96.0%.
(2)、Multi-metric measurement. In order to enable human to weigh language models from different perspectives, HELM proposes multi- metric measurement. HELM has covered 16 different scenarios and 7 metrics. To ensure the results of intensive multi-metric measurement, HELM measured 98 of 112 possible core scenarios (87.5%).
(3)、Standardization. The increase in the scale and training complexity of language models has seriously hindered human’s understanding of the structure of each language model. To establish a unified understanding of existing language models, HELM benchmarks 30 well-known language models, covering such institutions as Google (UL2(Tay et al., 2022)), OpenAI (GPT-3(Brown et al., 2020b)), and EleutherAI (GPT-NeoX(Black et al., 2022)). Interestingly, HELM pointed out that LMs such as T5 (Raffel et al., 2019) and Anthropic- LMv4-s3 (Bai et al., 2022a) had not been directly compared in the initial work, while LLMs such as GPT-3 and YaLM were still different from their corresponding reports after multiple evaluations.
(1)廣泛涵蓋。在開(kāi)發(fā)過(guò)程中,語(yǔ)言模型可以適應(yīng)各種自然語(yǔ)言處理任務(wù)(例如序列標(biāo)注和問(wèn)題回答),因此需要在廣泛的情景下進(jìn)行語(yǔ)言模型的評(píng)估。為了涵蓋所有潛在情景,HELM提出了一種自上而下的分類法,首先將主要的自然語(yǔ)言處理會(huì)議(ACL2022)中的所有現(xiàn)有任務(wù)編譯成任務(wù)空間,并將每個(gè)任務(wù)劃分為情景(例如語(yǔ)言)和指標(biāo)(例如準(zhǔn)確性)的形式。然后在面對(duì)特定任務(wù)時(shí),分類法會(huì)選擇任務(wù)空間中的一個(gè)或多個(gè)情景和指標(biāo)來(lái)涵蓋它。通過(guò)分析每個(gè)任務(wù)的結(jié)構(gòu),HELM明確了評(píng)估內(nèi)容(任務(wù)情景和指標(biāo)),并將語(yǔ)言模型的情景涵蓋范圍從17.9%提高到96.0%。
(2)多指標(biāo)測(cè)量。為了使人類能夠從不同角度權(quán)衡語(yǔ)言模型,HELM提出了多指標(biāo)測(cè)量。HELM涵蓋了16種不同的情景和7個(gè)指標(biāo)。為了確保密集的多指標(biāo)測(cè)量結(jié)果,HELM對(duì)112個(gè)可能的核心情景中的98個(gè)進(jìn)行了測(cè)量(87.5%)。
(3)標(biāo)準(zhǔn)化。語(yǔ)言模型規(guī)模和訓(xùn)練復(fù)雜性的增加嚴(yán)重阻礙了人類對(duì)每個(gè)語(yǔ)言模型結(jié)構(gòu)的理解。為了建立對(duì)現(xiàn)有語(yǔ)言模型的統(tǒng)一理解,HELM對(duì)30個(gè)知名語(yǔ)言模型進(jìn)行了基準(zhǔn)測(cè)試,涵蓋了Google(UL2(Tay等,2022))、OpenAI(GPT-3(Brown等,2020b))和EleutherAI(GPT-NeoX(Black等,2022))等機(jī)構(gòu)。有趣的是,HELM指出,例如T5(Raffel等,2019)和Anthropic- LMv4-s3(Bai等,2022a)等LLMs在初始工作中尚未直接進(jìn)行比較,而GPT-3和YaLM等LLMs在多次評(píng)估后仍與其對(duì)應(yīng)的報(bào)告不同。
8.2、Low-resource Instruction Tuning低資源指令微調(diào):STL需要數(shù)據(jù)量的25%、MTL需要數(shù)據(jù)量的6%
Gupta et al. (2023) attempts to estimate the minimal downstream training data required by IT models to match the SOTA supervised models over various tasks. Gupta et al. (2023) conducted experiments on 119 tasks from Super Natural Instructions (SuperNI) in both single-task learning (STL) and multi-task learning (MTL) settings. The results indicate that in the STL setting, IT models with only 25% of downstream training data outperform the SOTA models on those tasks, while in the MTL setting, just 6% of downstream training data can lead IT models to achieve the SOTA performance. These findings suggest that instruction tuning can effectively assist a model in quickly learning a task even with limited data.
However, due to resource limitations, Gupta et al. (2023) did not conduct experiments on LLMs, like T5-11B. So, to gain a more comprehensive understanding of the IT models, further investigation using larger language models and datasets is necessary.
Gupta等人(2023)試圖估計(jì)IT模型需要多少最少的下游訓(xùn)練數(shù)據(jù),才能在各種任務(wù)上匹配SOTA監(jiān)督模型。Gupta等人(2023)在超自然指令(SuperNI)的119個(gè)任務(wù)上進(jìn)行了實(shí)驗(yàn),包括單任務(wù)學(xué)習(xí)(STL)和多任務(wù)學(xué)習(xí)(MTL)設(shè)置。結(jié)果表明,在STL設(shè)置下,只需使用下游訓(xùn)練數(shù)據(jù)的25%即可在這些任務(wù)上勝過(guò)SOTA模型,而在MTL設(shè)置下,只需使用下游訓(xùn)練數(shù)據(jù)的6%即可使IT模型達(dá)到SOTA性能。這些發(fā)現(xiàn)表明,即使數(shù)據(jù)有限,指令微調(diào)也能有效地幫助模型迅速學(xué)習(xí)任務(wù)。
然而,由于資源限制,Gupta等人(2023)并沒(méi)有對(duì)像T5-11B這樣的LLMs進(jìn)行實(shí)驗(yàn)。因此,為了更全面地了解IT模型,需要進(jìn)一步使用更大的語(yǔ)言模型和數(shù)據(jù)集進(jìn)行調(diào)查。
8.3、Smaller Instruction Dataset更小的指令數(shù)據(jù)集:LIMA(精選1,000個(gè)訓(xùn)練示例)表面可過(guò)少數(shù)精心策劃的指令進(jìn)行微調(diào)
IT requires a substantial amount of specialized instruction data for training. Zhou et al. (2023) hypothesized that the pre-trained LLM only has to learn the style or format to interact with users and proposed LIMA that achieves strong performance by fine-tuning an LLM on only 1,000 carefully selected training examples.
Specifically, LIMA first manually curates 1,000 demonstrations with high-quality prompts and responses. Then the 1,000 demonstrations are used to fine-tune the pre-trained 65B-parameter LLaMa (Touvron et al., 2023b). By comparison, across more than 300 challenging tasks, LIMA outperfrms GPT-davinci003 (Brown et al., 2020b), which was fine-tuned on 5,200 examples by human feedback tuning. Moreover, with only half amount of demonstrations, LIMA achieves equivalent results to GPT-4 (OpenAI, 2023), Claude (Bai et al., 2022b), and Bard. Above all, LIMA demonstrated that LLMs’ powerful knowledge and capabilities can be exposed to users with only a few carefully curated instructions to fine-tune.
IT需要大量的專門(mén)指令數(shù)據(jù)進(jìn)行訓(xùn)練。Zhou等人(2023)假設(shè)預(yù)訓(xùn)練LLM只需學(xué)習(xí)與用戶互動(dòng)的樣式或格式,并提出了LIMA,通過(guò)僅在1,000個(gè)精選的訓(xùn)練示例上微調(diào)LLM,實(shí)現(xiàn)了強(qiáng)大的性能。
具體而言,LIMA首先手動(dòng)策劃了1,000個(gè)具有高質(zhì)量提示和回復(fù)的演示。然后,這1,000個(gè)演示用于微調(diào)預(yù)訓(xùn)練的65B參數(shù)LLaMa(Touvron等,2023b)。相比之下,在超過(guò)300個(gè)具有挑戰(zhàn)性的任務(wù)中,LIMA在表現(xiàn)上勝過(guò)了通過(guò)人工反饋微調(diào)的GPT-davinci003(Brown等,2020b)。此外,只有一半數(shù)量的示范,LIMA就可以實(shí)現(xiàn)與GPT-4(OpenAI,2023)、Claude(Bai等,2022b)和Bard相當(dāng)?shù)慕Y(jié)果。總之,LIMA表明LLMs的強(qiáng)大知識(shí)和能力可以通過(guò)少數(shù)精心策劃的指令進(jìn)行微調(diào)。
8.4、Evaluating Instruction-tuning Datasets評(píng)估指令微調(diào)數(shù)據(jù)集:缺乏開(kāi)放性和主觀性的評(píng)估?
The performance of IT model highly depends on the IT datasets. However, there lacks of evaluations for these IT datasets from open-ended and subjective aspects.
To address this issue, Wang et al. (2023c) performs dataset evaluation by fine-tuning the LLaMa model (Touvron et al., 2023b) on various of open IT datasets and measure different fine- tuned models through both automatic and human evaluations. An additional model is trained on the combination of IT datasets. For the results, Wang et al. (2023c) showed that there is not a single best IT dataset across all tasks, while by manually combining datasets it can achieve the best overall performance. Besides, Wang et al. (2023c) pointed out that though IT can bring large benefits on LLMs at all sizes, smaller models and models with a high base quality benefit most from IT. For human evaluations, Wang et al. (2023c) a larger model is more likely to gain a higher acceptability score.
IT模型的性能在很大程度上取決于IT數(shù)據(jù)集。然而,這些IT數(shù)據(jù)集在開(kāi)放性和主觀性方面缺乏評(píng)估。
為了解決這個(gè)問(wèn)題,Wang等人(2023c)通過(guò)在各種開(kāi)放IT數(shù)據(jù)集上微調(diào)LLaMa模型(Touvron等,2023b),并通過(guò)自動(dòng)和人工評(píng)估來(lái)測(cè)量不同的微調(diào)模型。還有一個(gè)模型是在IT數(shù)據(jù)集的組合上進(jìn)行訓(xùn)練的。根據(jù)結(jié)果,Wang等人(2023c)表明,并沒(méi)有一個(gè)單一的最佳IT數(shù)據(jù)集適用于所有任務(wù),但通過(guò)手動(dòng)組合數(shù)據(jù)集可以實(shí)現(xiàn)最佳整體性能。此外,Wang等人(2023c)指出,盡管IT在所有規(guī)模的LLMs上都能帶來(lái)很大的好處,但較小的模型和具有高基礎(chǔ)質(zhì)量的模型最能從IT中受益。對(duì)于人類評(píng)估,Wang等人(2023c)發(fā)現(xiàn)較大的模型更有可能獲得更高的可接受性評(píng)分。
8.5、Do IT just learn Pattern Copying?IT是否只是學(xué)習(xí)模式復(fù)制?——有論文指出基于IT的顯著改進(jìn)只是捕獲表面級(jí)別模式而非理解了本質(zhì)
To address the lack of clarity about the specific knowledge that models acquire through instruction tuning, Kung and Peng (2023) delves into the analysis of how models make use of instructions during IT by comparing the tuning when provided with altered instructions versus the original instructions.
Specifically, Kung and Peng (2023) creates simplified task definitions that remove all semantic components, leaving only the output information. In addition, Kung and Peng (2023) also incorporates delusive examples that contain incorrect input-output mapping. Surprisingly, the experiments show that models trained on these simplified task definitions or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Moreover, the paper also introduces a baseline for the classification task with zero-shot, which achieves similar performance to IT in low-resource settings.
In summary, according to Kung and Peng (2023), the notable performance improvements observed in current IT models may be attributed to their ability to capture surface-level patterns, such as learning the output format and making guesses, rather than comprehending and learning the specific task.
為了解決關(guān)于模型通過(guò)指令微調(diào)獲取特定知識(shí)的缺乏清晰性的問(wèn)題,Kung和Peng(2023)通過(guò)比較在提供修改后的指令與原始指令時(shí)的微調(diào)情況,深入分析了模型在指令微調(diào)過(guò)程中如何使用指令。
具體而言,Kung和Peng(2023)創(chuàng)建了簡(jiǎn)化的任務(wù)定義,去除了所有語(yǔ)義成分,只留下輸出信息。此外,Kung和Peng(2023)還包括包含不正確輸入-輸出映射的誤導(dǎo)性示例。令人驚訝的是,實(shí)驗(yàn)表明,訓(xùn)練在這些簡(jiǎn)化的任務(wù)定義或誤導(dǎo)性示例上的模型可以達(dá)到與在原始指令和示例上訓(xùn)練的模型相當(dāng)?shù)男阅堋4送?#xff0c;該論文還引入了零樣本分類任務(wù)的基線,其在低資源設(shè)置下實(shí)現(xiàn)了與IT相似的性能。
總之,根據(jù)Kung和Peng(2023)的觀點(diǎn),當(dāng)前IT模型中觀察到的顯著性能改進(jìn)可能歸因于其捕捉表面級(jí)別的模式,例如學(xué)習(xí)輸出格式和進(jìn)行猜測(cè),而不是理解和學(xué)習(xí)特定任務(wù)。
8.6、Proprietary LLMs Imitation專有LLMs模仿:微調(diào)模型能效仿ChatGPT的表達(dá)風(fēng)格,但不等于提升其通用能力→更應(yīng)注重基模型及指導(dǎo)實(shí)例的質(zhì)量
Gudibande等人通過(guò)收集ChatGPT在多個(gè)領(lǐng)域的輸出數(shù)據(jù),用于微調(diào)開(kāi)源模型,旨在使開(kāi)源模型在部分領(lǐng)域的能力接近專有模型。他們的實(shí)驗(yàn)顯示,在有模仿數(shù)據(jù)集支持的任務(wù)上,微調(diào)后模型的表現(xiàn)明顯提高,輸出與ChatGPT相似;但在沒(méi)有模仿數(shù)據(jù)集的任務(wù)上,微調(diào)模型無(wú)效甚至效果下降。他們指出微調(diào)模型能效仿ChatGPT的表達(dá)風(fēng)格,但不等于提升其通用能力。研究者應(yīng)注重基模型及指導(dǎo)實(shí)例的質(zhì)量,而不是模仿專有模型。
LLMs imitation is an approach that collects outputs from a stronger model, such as a proprietary system like ChatGPT, and uses these outputs to fine-tune an open-source LLM. Through this way, an open- source LLM may get competitive capabilities with any proprietary model.
Gudibande et al. (2023) conducted several experiments to critically analyze the efficacy of model imitation. Specifically, Gudibande et al. (2023) first collected datasets from outputs of ChatGPT over broad tasks. Then these datasets were used to fine-tune a range of models covering sizes from 1.5B to 13B, base models GPT-2 and LLaMA, and data amounts from 0.3M tokens to 150M tokens.
For evaluations, Gudibande et al. (2023) demonstrated that on tasks with supported datasets, imitation models are far better than before, and their outputs appear similar to ChatGPT’s. While on tasks without imitation datasets, imitation models do not have improvement or even decline in accuracy.
Thus, Gudibande et al. (2023) pointed out that it’s the phenomenon that imitation models are adept at mimicking ChatGPT’s style (e.g., being fluent, confident and well-structured) that makes researchers have the illusion about general abilities of imitation models. So, Gudibande et al. (2023) suggested that instead of imitating proprietary models, researchers had better focus on improving the quality of base models and instruction examples.
LLMs模仿是一種方法,它收集來(lái)自更強(qiáng)大模型(例如ChatGPT等專有系統(tǒng))的輸出,并使用這些輸出對(duì)開(kāi)源LLM進(jìn)行微調(diào)。通過(guò)這種方式,開(kāi)源LLM可以獲得與任何專有模型相當(dāng)?shù)哪芰Α?div style="height:15px;">
Gudibande等人(2023)進(jìn)行了多項(xiàng)實(shí)驗(yàn),以批判性地分析模型模仿的效果。具體而言,Gudibande等人(2023)首先從廣泛的任務(wù)中收集了ChatGPT的輸出數(shù)據(jù)集。然后,這些數(shù)據(jù)集被用于微調(diào)覆蓋從1.5B到13B大小的一系列模型,基礎(chǔ)模型為GPT-2和LLaMA,數(shù)據(jù)量為0.3M到150M個(gè)標(biāo)記。
在評(píng)估方面,Gudibande等人(2023)證明,在有支持?jǐn)?shù)據(jù)集的任務(wù)上,模仿模型遠(yuǎn)遠(yuǎn)優(yōu)于以前,其輸出與ChatGPT的輸出相似。然而,在沒(méi)有模仿數(shù)據(jù)集的任務(wù)中,模仿模型沒(méi)有提高甚至在準(zhǔn)確性方面下降。
因此,Gudibande等人(2023)指出,模仿模型擅長(zhǎng)模仿ChatGPT的風(fēng)格(例如流利、自信和良好結(jié)構(gòu)),這使得研究人員產(chǎn)生了有關(guān)模仿模型的普遍能力的錯(cuò)覺(jué)。因此,Gudibande等人(2023)建議,研究人員不應(yīng)該模仿專有模型,而應(yīng)該專注于提高基礎(chǔ)模型和指令示例的質(zhì)量。
This work surveys recent advances in the fast growing field of instruction tuning. We make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, IT’s applications to different modalities, domains and application. We also review analysis on IT models to discover both their advantages and potential pitfalls. We hope this work will act as a stimulus to motivate further endeavors to address the deficiencies of current IT models.
本文對(duì)迅速發(fā)展的指令微調(diào)領(lǐng)域的最新進(jìn)展進(jìn)行了綜述。我們對(duì)文獻(xiàn)進(jìn)行了系統(tǒng)性的回顧,包括IT的一般方法論、IT數(shù)據(jù)集的構(gòu)建、IT模型的訓(xùn)練,以及IT在不同形式、領(lǐng)域和應(yīng)用中的應(yīng)用。我們還對(duì)IT模型的分析進(jìn)行了回顧,以發(fā)現(xiàn)它們的優(yōu)勢(shì)和潛在問(wèn)題。我們希望本文能夠作為一個(gè)刺激,激勵(lì)更多的努力來(lái)解決當(dāng)前IT模型的不足之處。