Deploying and scaling up with SkyPilot

vLLM can be run and scaled to multiple service replicas on clouds and Kubernetes with SkyPilot, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in SkyPilot AI gallery.
Prerequisites
Go to the HuggingFace model page and request access to the model
meta-llama/Meta-Llama-3-8B-Instruct.Check that you have installed SkyPilot (docs).
Check that
sky checkshows clouds or Kubernetes are enabled.
pip install skypilot-nightlysky check
Run on a single instance
See the vLLM SkyPilot YAML for serving, serving.yaml.
resources:accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.use_spot: Truedisk_size: 512 # Ensure model checkpoints can fit.disk_tier: bestports: 8081 # Expose to internet traffic.envs:MODEL_NAME: meta-llama/Meta-Llama-3-8B-InstructHF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.setup: |conda create -n vllm python=3.10 -yconda activate vllmpip install vllm==0.4.0.post1# Install Gradio for web UI.pip install gradio openaipip install flash-attn==2.5.7run: |conda activate vllmecho 'Starting vllm api server...'python -u -m vllm.entrypoints.openai.api_server \--port 8081 \--model $MODEL_NAME \--trust-remote-code \--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \2>&1 | tee api_server.log &echo 'Waiting for vllm api server to start...'while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; doneecho 'Starting gradio server...'git clone https://github.com/vllm-project/vllm.git || truepython vllm/examples/gradio_openai_chatbot_webserver.py \-m $MODEL_NAME \--port 8811 \--model-url http://localhost:8081/v1 \--stop-token-ids 128009,128001
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, …):
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
Optional: Serve the 70B model instead of the default 8B and use more GPU:
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
Scale up to multiple replicas
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
service:replicas: 2# An actual request for readiness probe.readiness_probe:path: /v1/chat/completionspost_data:model: $MODEL_NAMEmessages:- role: usercontent: Hello! What is your name?max_tokens: 1
Click to see the full recipe YAML
service:replicas: 2# An actual request for readiness probe.readiness_probe:path: /v1/chat/completionspost_data:model: $MODEL_NAMEmessages:- role: usercontent: Hello! What is your name?max_tokens: 1resources:accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.use_spot: Truedisk_size: 512 # Ensure model checkpoints can fit.disk_tier: bestports: 8081 # Expose to internet traffic.envs:MODEL_NAME: meta-llama/Meta-Llama-3-8B-InstructHF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.setup: |conda create -n vllm python=3.10 -yconda activate vllmpip install vllm==0.4.0.post1# Install Gradio for web UI.pip install gradio openaipip install flash-attn==2.5.7run: |conda activate vllmecho 'Starting vllm api server...'python -u -m vllm.entrypoints.openai.api_server \--port 8081 \--model $MODEL_NAME \--trust-remote-code \--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \2>&1 | tee api_server.log &echo 'Waiting for vllm api server to start...'while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; doneecho 'Starting gradio server...'git clone https://github.com/vllm-project/vllm.git || truepython vllm/examples/gradio_openai_chatbot_webserver.py \-m $MODEL_NAME \--port 8811 \--model-url http://localhost:8081/v1 \--stop-token-ids 128009,128001
Start the serving the Llama-3 8B model on multiple replicas:
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
Wait until the service is ready:
watch -n10 sky serve status vllm
Example outputs:
ServicesNAME VERSION UPTIME STATUS REPLICAS ENDPOINTvllm 1 35s READY 2/2 xx.yy.zz.100:30001Service ReplicasSERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGIONvllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
ENDPOINT=$(sky serve status --endpoint 8081 vllm)curl -L http://$ENDPOINT/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct","messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "Who are you?"}],"stop_token_ids": [128009, 128001]}'
To enable autoscaling, you could specify additional configs in services:
services:replica_policy:min_replicas: 0max_replicas: 3target_qps_per_replica: 2
This will scale the service up to when the QPS exceeds 2 for each replica.
Optional: Connect a GUI to the endpoint
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
Click to see the full GUI YAML
envs:MODEL_NAME: meta-llama/Meta-Llama-3-70B-InstructENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.resources:cpus: 2setup: |conda activate vllmif [ $? -ne 0 ]; thenconda create -n vllm python=3.10 -yconda activate vllmfi# Install Gradio for web UI.pip install gradio openairun: |conda activate vllmexport PATH=$PATH:/sbinWORKER_IP=$(hostname -I | cut -d' ' -f1)CONTROLLER_PORT=21001WORKER_PORT=21002echo 'Starting gradio server...'git clone https://github.com/vllm-project/vllm.git || truepython vllm/examples/gradio_openai_chatbot_webserver.py \-m $MODEL_NAME \--port 8811 \--model-url http://$ENDPOINT/v1 \--stop-token-ids 128009,128001 | tee ~/gradio.log
- Start the chat web UI:
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
- Then, we can access the GUI at the returned gradio link:
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live