Remove temporary test file that was accidentally committed

Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Add comprehensive documentation and examples for CPU weight loading
2025-09-18 00:41:19 +00:00 · 2025-09-18 00:41:05 +00:00 · 2025-09-18 00:34:31 +00:00
3 changed files with 140 additions and 0 deletions
--- a/docs/usage/faq.md
+++ b/docs/usage/faq.md
@ -33,3 +33,52 @@ different tokens being sampled. Once a different token is sampled, further diver
 - For improved stability and reduced variance, use `float32`. Note that this will require more memory.
 - If using `bfloat16`, switching to `float16` can also help.
 - Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
+
+---
+
+> Q: How do you load weights from CPU?
+
+A: vLLM supports loading model weights from CPU using the `pt_load_map_location` parameter. This parameter controls where PyTorch checkpoints are loaded to and is especially useful when:
+
+- You have model weights stored on CPU and want to load them directly
+- You need to manage memory usage by loading weights to CPU first
+- You want to load from specific device mappings
+
+## Usage Examples
+
+### Command Line Interface
+
+```bash
+# Load weights from CPU
+vllm serve meta-llama/Llama-2-7b-hf --pt-load-map-location cpu
+
+# Load from specific device mapping (e.g., CUDA device 1 to device 0)
+vllm serve meta-llama/Llama-2-7b-hf --pt-load-map-location '{"cuda:1": "cuda:0"}'
+```
+
+### Python API
+
+```python
+from vllm import LLM
+
+# Load weights from CPU
+llm = LLM(
+    model="meta-llama/Llama-2-7b-hf",
+    pt_load_map_location="cpu"
+)
+
+# Load with device mapping
+llm = LLM(
+    model="meta-llama/Llama-2-7b-hf", 
+    pt_load_map_location={"cuda:1": "cuda:0"}
+)
+```
+
+The `pt_load_map_location` parameter accepts the same values as PyTorch's [`torch.load(map_location=...)`](https://pytorch.org/docs/stable/generated/torch.load.html) parameter:
+
+- `"cpu"` - Load all weights to CPU
+- `"cuda"` - Load all weights to CUDA (equivalent to `{"": "cuda"}`)
+- `{"cuda:1": "cuda:0"}` - Map weights from CUDA device 1 to device 0
+- Custom device mappings as needed
+
+Note: This parameter defaults to `"cpu"` and primarily affects PyTorch `.pt`/`.bin` checkpoint files. For optimal performance on GPU inference, weights will be moved to the target device after loading.
--- a/examples/offline_inference/basic/README.md
+++ b/examples/offline_inference/basic/README.md
@ -78,3 +78,23 @@ Try it yourself with the following arguments:
 ```bash
 --model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
 ```
+
+### CPU weight loading
+
+The `cpu_weight_loading.py` example demonstrates how to control where model weights are loaded from using the `pt_load_map_location` parameter. This is particularly useful for memory management and when working with PyTorch checkpoint files.
+
+Try it yourself:
+
+```bash
+python examples/offline_inference/basic/cpu_weight_loading.py
+```
+
+You can also use this parameter with other scripts that support argument parsing:
+
+```bash
+# Load weights from CPU (default behavior)
+python examples/offline_inference/basic/chat.py --pt-load-map-location cpu
+
+# Use custom device mapping (example syntax)
+python examples/offline_inference/basic/generate.py --pt-load-map-location '{"": "cpu"}'
+```
--- a/examples/offline_inference/basic/cpu_weight_loading.py
+++ b/examples/offline_inference/basic/cpu_weight_loading.py
@ -0,0 +1,71 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+"""
+Example demonstrating how to load model weights from CPU using pt_load_map_location.
+
+This is useful when:
+- You want to explicitly load PyTorch checkpoints from CPU
+- You need to manage memory usage during model initialization
+- You want to map weights from one device to another
+
+The pt_load_map_location parameter works the same as PyTorch's torch.load(map_location=...)
+and defaults to "cpu" for most efficient loading.
+"""
+
+from vllm import LLM, SamplingParams
+
+# Sample prompts.
+prompts = [
+    "The advantages of loading weights from CPU include",
+    "When should you use CPU weight loading?",
+    "Memory management in machine learning is important because",
+]
+
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
+
+
+def main():
+    # Example 1: Explicitly load weights from CPU (default behavior)
+    print("=== Example 1: Loading weights from CPU ===")
+    llm = LLM(
+        model="facebook/opt-125m",
+        pt_load_map_location="cpu"  # Explicitly specify CPU loading
+    )
+    
+    outputs = llm.generate(prompts[:1], sampling_params)
+    for output in outputs:
+        print(f"Prompt: {output.prompt}")
+        print(f"Output: {output.outputs[0].text}")
+    
+    # Example 2: Using device mapping (useful for multi-GPU setups)
+    print("\n=== Example 2: Device mapping example ===")
+    # Note: This example shows the syntax, but may not be applicable 
+    # unless you have multiple CUDA devices available
+    try:
+        llm_mapped = LLM(
+            model="facebook/opt-125m",
+            pt_load_map_location={"": "cpu"}  # Alternative syntax for CPU
+        )
+        
+        outputs = llm_mapped.generate(prompts[1:2], sampling_params)
+        for output in outputs:
+            print(f"Prompt: {output.prompt}")
+            print(f"Output: {output.outputs[0].text}")
+            
+    except Exception as e:
+        print(f"Device mapping example failed (this is normal if no CUDA available): {e}")
+    
+    # Example 3: Default behavior (pt_load_map_location="cpu" is the default)
+    print("\n=== Example 3: Default behavior (CPU loading) ===")
+    llm_default = LLM(model="facebook/opt-125m")  # Uses CPU loading by default
+    
+    outputs = llm_default.generate(prompts[2:3], sampling_params)
+    for output in outputs:
+        print(f"Prompt: {output.prompt}")
+        print(f"Output: {output.outputs[0].text}")
+
+
+if __name__ == "__main__":
+    main()
Author	SHA1	Message	Date
copilot-swe-agent[bot]	6b814cf1e8	Remove temporary test file that was accidentally committed Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>	2025-09-18 00:41:19 +00:00
copilot-swe-agent[bot]	bf253afc1c	Add comprehensive documentation and examples for CPU weight loading Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>	2025-09-18 00:41:05 +00:00
copilot-swe-agent[bot]	8c64cf87f0	Initial plan	2025-09-18 00:34:31 +00:00