[Doc] Rename offline inference examples (#11927)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-10 15:50:29 +00:00
committed by GitHub
parent 20410b2fda
commit 482cdc494e
46 changed files with 46 additions and 46 deletions

View File

@ -26,7 +26,7 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
### Offline Inference
Refer to <gh-file:examples/offline_inference/offline_inference_with_profiler.py> for an example.
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example.
### OpenAI Server

View File

@ -257,4 +257,4 @@ outputs = llm.generate(
print(outputs[0].outputs[0].text)
```
Full example: <gh-file:examples/offline_inference/offline_inference_structured_outputs.py>
Full example: <gh-file:examples/offline_inference/structured_outputs.py>

View File

@ -95,7 +95,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install
$ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
$ find / -name *libtcmalloc* # find the dynamic link library path
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
$ python examples/offline_inference/offline_inference.py # run vLLM
$ python examples/offline_inference/basic.py # run vLLM
```
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
@ -132,7 +132,7 @@ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/offline_inference.py
$ python examples/offline_inference/basic.py
```
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.

View File

@ -40,7 +40,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/offline_inference.py>
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic.py>
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:

View File

@ -46,7 +46,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference.py>
A code example can be found here: <gh-file:examples/offline_inference/basic.py>
### `LLM.beam_search`
@ -103,7 +103,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_chat.py>
A code example can be found here: <gh-file:examples/offline_inference/chat.py>
If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template:

View File

@ -88,7 +88,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_embedding.py>
A code example can be found here: <gh-file:examples/offline_inference/embedding.py>
### `LLM.classify`
@ -103,7 +103,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_classification.py>
A code example can be found here: <gh-file:examples/offline_inference/classification.py>
### `LLM.score`
@ -125,7 +125,7 @@ score = output.outputs.score
print(f"Score: {score}")
```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
A code example can be found here: <gh-file:examples/offline_inference/scoring.py>
## Online Serving

View File

@ -60,7 +60,7 @@ for o in outputs:
print(generated_text)
```
Full example: <gh-file:examples/offline_inference/offline_inference_vision_language.py>
Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
@ -91,7 +91,7 @@ for o in outputs:
print(generated_text)
```
Full example: <gh-file:examples/offline_inference/offline_inference_vision_language_multi_image.py>
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
@ -125,13 +125,13 @@ for o in outputs:
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input.
Full example: <gh-file:examples/offline_inference/offline_inference_vision_language.py>
Full example: <gh-file:examples/offline_inference/vision_language.py>
### Audio
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference/offline_inference_audio_language.py>
Full example: <gh-file:examples/offline_inference/audio_language.py>
### Embedding