From 41d3082c416897092bc924bc341e86b3e49728ee Mon Sep 17 00:00:00 2001 From: Daniel Han Date: Fri, 25 Jul 2025 17:06:48 -0700 Subject: [PATCH] Add Unsloth to RLHF.md (#21636) --- docs/training/rlhf.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/training/rlhf.md b/docs/training/rlhf.md index 4f75e4e014..f608a630ab 100644 --- a/docs/training/rlhf.md +++ b/docs/training/rlhf.md @@ -2,10 +2,14 @@ Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. -vLLM can be used to generate the completions for RLHF. The best way to do this is with libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl). +vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl) and [unsloth](https://github.com/unslothai/unsloth). See the following basic examples to get started if you don't want to use an existing library: - [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md) - [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md) - [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md) + +See the following notebooks showing how to use vLLM for GRPO: + +- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)