注: [1] https://www.jasonwei.net/blog/emergencehttps://www.yitay.net/blog/emergence-and-scaling[2] Wei et. al. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models [3] https://lingo.csail.mit.edu/blog/arithmetic_gpt3/[4] Wei et. al. 2022. Emergent Abilities of Large Language Models [5] 截止到 2022 年 11 月,仍沒有嚴格的證據表明這些能力存在于小模型 [6] 在 2022 年 11 月,在 text-davinci-002 上評估 GSM8K 測試集需要 $50 [7] Google 不提供對 PaLM 的公共訪問;OpenAI 不允許一些國家的研究人員訪問 GPT3 和 Codex(截至 2022 年 11 月) [8] GPT-3 的第一個版本(2020 年 5 月)在許多任務上無法勝過精調 T5 [9] Wei et. al. 2022. Emergent Abilities of Large Language Models. [10] Cobbe et. al. 2021. Training Verifiers to Solve Math Word Problems [11] GPT3 一直在持續更新。最新版本 text-davinci-002 現在與 2020 年的原始版本有很大不同。 [12] Wei et. al. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models [13] Wang et. al. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models [14] Fu et. al. 2022. Complexity-Based Prompting for Multi-step Reasoning [15] 目前還沒有能公平對比提示詞和微調的工作。但當思維鏈被提出的時候,盡管他們對于提示和精調的比較可能是不公平的,但它們比精調效果要好。 [16] Chung et. al. 2022. Scaling Instruction-Finetuned Language Models [17] Lewkowycz et. al. 2022. Minerva: Solving Quantitative Reasoning Problems with Language Models [18] Jiang et. al. 2022. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [19] Xu et. al. 2021. Fusing Context Into Knowledge Graph for Commonsense Question Answering [20] Khashabi et. al. 2020. UnifiedQA: Crossing Format Boundaries With a Single QA System [21] Yu et. al. 2022. Generate rather than Retrieve: Large Language Models are Strong Context Generators [22] Jung et. al. 2022. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations [23] 雖然這些知識可能過時或者不可信,但選擇哪種可信知識源超出了本文的討論范圍 [24] Si et. al. 2022. Prompting GPT-3 to be Reliable. [25] Fu et. al. 2022. Complexity-based Prompting for Multi-Step Reasoning [26] Kaplan et. al. 2020. Scaling Laws for Neural Language Models [27] Brown et. al. 2020. anguage Models are Few-Shot Learners. [28] Cobbe et. al. 2021. Training Verifiers to Solve Math Word Problems [29] Li and Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation [30] He et. al. 2021. Towards a Unified View of Parameter-Efficient Transfer Learning [31] Chung et. al. 2022. Scaling Instruction-Finetuned Language Models [32] Suzgun et. al. 2022. Challenging BIG-Bench tasks and whether chain-of-thought can solve them [33] 在本文發布的兩個月之后,更多的模型被公布,很多新的模型也都可以做思維鏈,比如 UL2, FlanT5 [34] Suzgun. et. al. 2022. Challenging Big-Bench tasks and whether chain-of-thought can solve them Fu et. al. 2022. Complexity-Based Prompting for Multi-Step Reasoning Madaan et. al. 2022. Language Models of Code are Few-Shot Commonsense Learners [35] Ouyang et. al. 2022. Training language models to follow instructions with human feedback [36] Chowdhery et. al. 2022. PaLM: Scaling Language Modeling with Pathways [37] Chung. et. al. 2022. Scaling Instruction-Finetuned Language Models [38] Chung. et. al. 2022. Scaling Instruction-Finetuned Language Models Huang et. al. 2022. Large Language Models Can Self-Improv