Compare commits

...

3 Commits

3 changed files with 140 additions and 0 deletions

View File

@ -33,3 +33,52 @@ different tokens being sampled. Once a different token is sampled, further diver
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
- If using `bfloat16`, switching to `float16` can also help.
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
---
> Q: How do you load weights from CPU?
A: vLLM supports loading model weights from CPU using the `pt_load_map_location` parameter. This parameter controls where PyTorch checkpoints are loaded to and is especially useful when:
- You have model weights stored on CPU and want to load them directly
- You need to manage memory usage by loading weights to CPU first
- You want to load from specific device mappings
## Usage Examples
### Command Line Interface
```bash
# Load weights from CPU
vllm serve meta-llama/Llama-2-7b-hf --pt-load-map-location cpu
# Load from specific device mapping (e.g., CUDA device 1 to device 0)
vllm serve meta-llama/Llama-2-7b-hf --pt-load-map-location '{"cuda:1": "cuda:0"}'
```
### Python API
```python
from vllm import LLM
# Load weights from CPU
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
pt_load_map_location="cpu"
)
# Load with device mapping
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
pt_load_map_location={"cuda:1": "cuda:0"}
)
```
The `pt_load_map_location` parameter accepts the same values as PyTorch's [`torch.load(map_location=...)`](https://pytorch.org/docs/stable/generated/torch.load.html) parameter:
- `"cpu"` - Load all weights to CPU
- `"cuda"` - Load all weights to CUDA (equivalent to `{"": "cuda"}`)
- `{"cuda:1": "cuda:0"}` - Map weights from CUDA device 1 to device 0
- Custom device mappings as needed
Note: This parameter defaults to `"cpu"` and primarily affects PyTorch `.pt`/`.bin` checkpoint files. For optimal performance on GPU inference, weights will be moved to the target device after loading.

View File

@ -78,3 +78,23 @@ Try it yourself with the following arguments:
```bash
--model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
```
### CPU weight loading
The `cpu_weight_loading.py` example demonstrates how to control where model weights are loaded from using the `pt_load_map_location` parameter. This is particularly useful for memory management and when working with PyTorch checkpoint files.
Try it yourself:
```bash
python examples/offline_inference/basic/cpu_weight_loading.py
```
You can also use this parameter with other scripts that support argument parsing:
```bash
# Load weights from CPU (default behavior)
python examples/offline_inference/basic/chat.py --pt-load-map-location cpu
# Use custom device mapping (example syntax)
python examples/offline_inference/basic/generate.py --pt-load-map-location '{"": "cpu"}'
```

View File

@ -0,0 +1,71 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example demonstrating how to load model weights from CPU using pt_load_map_location.
This is useful when:
- You want to explicitly load PyTorch checkpoints from CPU
- You need to manage memory usage during model initialization
- You want to map weights from one device to another
The pt_load_map_location parameter works the same as PyTorch's torch.load(map_location=...)
and defaults to "cpu" for most efficient loading.
"""
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"The advantages of loading weights from CPU include",
"When should you use CPU weight loading?",
"Memory management in machine learning is important because",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
def main():
# Example 1: Explicitly load weights from CPU (default behavior)
print("=== Example 1: Loading weights from CPU ===")
llm = LLM(
model="facebook/opt-125m",
pt_load_map_location="cpu" # Explicitly specify CPU loading
)
outputs = llm.generate(prompts[:1], sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}")
# Example 2: Using device mapping (useful for multi-GPU setups)
print("\n=== Example 2: Device mapping example ===")
# Note: This example shows the syntax, but may not be applicable
# unless you have multiple CUDA devices available
try:
llm_mapped = LLM(
model="facebook/opt-125m",
pt_load_map_location={"": "cpu"} # Alternative syntax for CPU
)
outputs = llm_mapped.generate(prompts[1:2], sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}")
except Exception as e:
print(f"Device mapping example failed (this is normal if no CUDA available): {e}")
# Example 3: Default behavior (pt_load_map_location="cpu" is the default)
print("\n=== Example 3: Default behavior (CPU loading) ===")
llm_default = LLM(model="facebook/opt-125m") # Uses CPU loading by default
outputs = llm_default.generate(prompts[2:3], sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}")
if __name__ == "__main__":
main()