LLMs1 DPO: Training A Language Model To Satisfy Human Preferences Direct Preference Optimization:Your Language Model is Secretly a Reward ModelRafael Rafailov∗† Archit Sharma∗† Eric Mitchell∗† Stefano Ermon†‡ Christopher D. Manning† Chelsea Finn†13 Dec 2023 Policy Preferred by HumansLarge-scale unsupervised language models are known to solve various tasks based on extensive knowledges. These generative models produce responses based on their policy. RLHF (Rei.. 2024. 7. 4. 이전 1 다음