[Docs] Fix syntax highlighting of shell commands (#19870)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This commit is contained in:
@ -10,7 +10,7 @@ title: Using Docker
|
||||
vLLM offers an official Docker image for deployment.
|
||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run --runtime nvidia --gpus all \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
||||
@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \
|
||||
|
||||
This image can also be used with other container engines such as [Podman](https://podman.io/).
|
||||
|
||||
```console
|
||||
```bash
|
||||
podman run --gpus all \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (`
|
||||
|
||||
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
|
||||
|
||||
```console
|
||||
```bash
|
||||
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
|
||||
DOCKER_BUILDKIT=1 docker build . \
|
||||
--target vllm-openai \
|
||||
@ -99,7 +99,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
||||
|
||||
??? Command
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||
python3 use_existing_torch.py
|
||||
DOCKER_BUILDKIT=1 docker build . \
|
||||
@ -118,7 +118,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
||||
|
||||
Run the following command on your host machine to register QEMU user static handlers:
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
||||
```
|
||||
|
||||
@ -128,7 +128,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
||||
|
||||
To run vLLM with the custom-built Docker image:
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run --runtime nvidia --gpus all \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
-p 8000:8000 \
|
||||
|
||||
@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
|
||||
```
|
||||
|
||||
|
||||
@ -11,7 +11,7 @@ title: AutoGen
|
||||
|
||||
- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm
|
||||
|
||||
# Install AgentChat and OpenAI client from Extensions
|
||||
@ -23,7 +23,7 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model mistralai/Mistral-7B-Instruct-v0.2
|
||||
```
|
||||
|
||||
@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
|
||||
|
||||
To install the Cerebrium client, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install cerebrium
|
||||
cerebrium login
|
||||
```
|
||||
|
||||
Next, create your Cerebrium project, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
cerebrium init vllm-project
|
||||
```
|
||||
|
||||
@ -58,7 +58,7 @@ Next, let us add our code to handle inference for the LLM of your choice (`mistr
|
||||
|
||||
Then, run the following code to deploy it to the cloud:
|
||||
|
||||
```console
|
||||
```bash
|
||||
cerebrium deploy
|
||||
```
|
||||
|
||||
|
||||
@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
|
||||
@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve Qwen/Qwen1.5-7B-Chat
|
||||
```
|
||||
|
||||
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/langgenius/dify.git
|
||||
cd dify
|
||||
cd docker
|
||||
|
||||
@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
|
||||
|
||||
To install dstack client, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install "dstack[all]
|
||||
dstack server
|
||||
```
|
||||
|
||||
Next, to configure your dstack project, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
mkdir -p vllm-dstack
|
||||
cd vllm-dstack
|
||||
dstack init
|
||||
|
||||
@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
||||
|
||||
- Setup vLLM and Haystack environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm haystack-ai
|
||||
```
|
||||
|
||||
@ -21,7 +21,7 @@ pip install vllm haystack-ai
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve mistralai/Mistral-7B-Instruct-v0.1
|
||||
```
|
||||
|
||||
|
||||
@ -22,7 +22,7 @@ Before you begin, ensure that you have the following:
|
||||
|
||||
To install the chart with the release name `test-vllm`:
|
||||
|
||||
```console
|
||||
```bash
|
||||
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
|
||||
```
|
||||
|
||||
@ -30,7 +30,7 @@ helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f val
|
||||
|
||||
To uninstall the `test-vllm` deployment:
|
||||
|
||||
```console
|
||||
```bash
|
||||
helm uninstall test-vllm --namespace=ns-vllm
|
||||
```
|
||||
|
||||
|
||||
@ -18,7 +18,7 @@ And LiteLLM supports all models on VLLM.
|
||||
|
||||
- Setup vLLM and litellm environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm litellm
|
||||
```
|
||||
|
||||
@ -28,7 +28,7 @@ pip install vllm litellm
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
@ -56,7 +56,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
|
||||
- Start the vLLM server with the supported embedding model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve BAAI/bge-base-en-v1.5
|
||||
```
|
||||
|
||||
|
||||
@ -7,13 +7,13 @@ title: Open WebUI
|
||||
|
||||
2. Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run -d -p 3000:8080 \
|
||||
--name open-webui \
|
||||
-v open-webui:/app/backend/data \
|
||||
|
||||
@ -15,7 +15,7 @@ Here are the integrations:
|
||||
|
||||
- Setup vLLM and langchain environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install -U vllm \
|
||||
langchain_milvus langchain_openai \
|
||||
langchain_community beautifulsoup4 \
|
||||
@ -26,14 +26,14 @@ pip install -U vllm \
|
||||
|
||||
- Start the vLLM server with the supported embedding model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start embedding service (port 8000)
|
||||
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
||||
```
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start chat service (port 8001)
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
||||
```
|
||||
@ -52,7 +52,7 @@ python retrieval_augmented_generation_with_langchain.py
|
||||
|
||||
- Setup vLLM and llamaindex environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm \
|
||||
llama-index llama-index-readers-web \
|
||||
llama-index-llms-openai-like \
|
||||
@ -64,14 +64,14 @@ pip install vllm \
|
||||
|
||||
- Start the vLLM server with the supported embedding model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start embedding service (port 8000)
|
||||
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
||||
```
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start chat service (port 8001)
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
||||
```
|
||||
|
||||
@ -15,7 +15,7 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
|
||||
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
|
||||
- Check that `sky check` shows clouds or Kubernetes are enabled.
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install skypilot-nightly
|
||||
sky check
|
||||
```
|
||||
@ -71,7 +71,7 @@ See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypil
|
||||
|
||||
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
||||
```
|
||||
|
||||
@ -83,7 +83,7 @@ Check the output of the command. There will be a shareable gradio link (like the
|
||||
|
||||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" \
|
||||
sky launch serving.yaml \
|
||||
--gpus A100:8 \
|
||||
@ -159,7 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
|
||||
|
||||
Start the serving the Llama-3 8B model on multiple replicas:
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" \
|
||||
sky serve up -n vllm serving.yaml \
|
||||
--env HF_TOKEN
|
||||
@ -167,7 +167,7 @@ HF_TOKEN="your-huggingface-token" \
|
||||
|
||||
Wait until the service is ready:
|
||||
|
||||
```console
|
||||
```bash
|
||||
watch -n10 sky serve status vllm
|
||||
```
|
||||
|
||||
@ -271,13 +271,13 @@ This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
To update the service with the new config:
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
|
||||
```
|
||||
|
||||
To stop the service:
|
||||
|
||||
```console
|
||||
```bash
|
||||
sky serve down vllm
|
||||
```
|
||||
|
||||
@ -317,7 +317,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
|
||||
|
||||
1. Start the chat web UI:
|
||||
|
||||
```console
|
||||
```bash
|
||||
sky launch \
|
||||
-c gui ./gui.yaml \
|
||||
--env ENDPOINT=$(sky serve status --endpoint vllm)
|
||||
|
||||
@ -15,13 +15,13 @@ It can be quickly integrated with vLLM as a backend API server, enabling powerfu
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
- Install streamlit and openai:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install streamlit openai
|
||||
```
|
||||
|
||||
@ -29,7 +29,7 @@ pip install streamlit openai
|
||||
|
||||
- Start the streamlit web UI and start to chat:
|
||||
|
||||
```console
|
||||
```bash
|
||||
streamlit run streamlit_openai_chatbot_webserver.py
|
||||
|
||||
# or specify the VLLM_API_BASE or VLLM_API_KEY
|
||||
|
||||
@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
|
||||
|
||||
To install Llama Stack, run
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install llama-stack -q
|
||||
```
|
||||
|
||||
|
||||
@ -115,7 +115,7 @@ Next, start the vLLM server as a Kubernetes Deployment and Service:
|
||||
|
||||
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
|
||||
|
||||
```console
|
||||
```bash
|
||||
kubectl logs -l app.kubernetes.io/name=vllm
|
||||
...
|
||||
INFO: Started server process [1]
|
||||
@ -358,14 +358,14 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
||||
|
||||
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
|
||||
|
||||
```console
|
||||
```bash
|
||||
kubectl apply -f deployment.yaml
|
||||
kubectl apply -f service.yaml
|
||||
```
|
||||
|
||||
To test the deployment, run the following `curl` command:
|
||||
|
||||
```console
|
||||
```bash
|
||||
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
|
||||
@ -11,13 +11,13 @@ This document shows how to launch multiple vLLM serving containers and use Nginx
|
||||
|
||||
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
|
||||
|
||||
```console
|
||||
```bash
|
||||
export vllm_root=`pwd`
|
||||
```
|
||||
|
||||
Create a file named `Dockerfile.nginx`:
|
||||
|
||||
```console
|
||||
```dockerfile
|
||||
FROM nginx:latest
|
||||
RUN rm /etc/nginx/conf.d/default.conf
|
||||
EXPOSE 80
|
||||
@ -26,7 +26,7 @@ CMD ["nginx", "-g", "daemon off;"]
|
||||
|
||||
Build the container:
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker build . -f Dockerfile.nginx --tag nginx-lb
|
||||
```
|
||||
|
||||
@ -60,14 +60,14 @@ Create a file named `nginx_conf/nginx.conf`. Note that you can add as many serve
|
||||
|
||||
## Build vLLM Container
|
||||
|
||||
```console
|
||||
```bash
|
||||
cd $vllm_root
|
||||
docker build -f docker/Dockerfile . --tag vllm
|
||||
```
|
||||
|
||||
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
|
||||
|
||||
```console
|
||||
```bash
|
||||
cd $vllm_root
|
||||
docker build \
|
||||
-f docker/Dockerfile . \
|
||||
@ -80,7 +80,7 @@ docker build \
|
||||
|
||||
## Create Docker Network
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker network create vllm_nginx
|
||||
```
|
||||
|
||||
@ -129,7 +129,7 @@ Notes:
|
||||
|
||||
## Launch Nginx
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run \
|
||||
-itd \
|
||||
-p 8000:80 \
|
||||
@ -142,7 +142,7 @@ docker run \
|
||||
|
||||
## Verify That vLLM Servers Are Ready
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker logs vllm0 | grep Uvicorn
|
||||
docker logs vllm1 | grep Uvicorn
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user