Deploying and scaling up with SkyPilot

vLLM

vLLM can be run and scaled to multiple service replicas on clouds and Kubernetes with SkyPilot, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in SkyPilot AI gallery.

Prerequisites

  • Go to the HuggingFace model page and request access to the model meta-llama/Meta-Llama-3-8B-Instruct.

  • Check that you have installed SkyPilot (docs).

  • Check that sky check shows clouds or Kubernetes are enabled.

  1. pip install skypilot-nightly
  2. sky check

Run on a single instance

See the vLLM SkyPilot YAML for serving, serving.yaml.

  1. resources:
  2. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  3. use_spot: True
  4. disk_size: 512 # Ensure model checkpoints can fit.
  5. disk_tier: best
  6. ports: 8081 # Expose to internet traffic.
  7. envs:
  8. MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  9. HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
  10. setup: |
  11. conda create -n vllm python=3.10 -y
  12. conda activate vllm
  13. pip install vllm==0.4.0.post1
  14. # Install Gradio for web UI.
  15. pip install gradio openai
  16. pip install flash-attn==2.5.7
  17. run: |
  18. conda activate vllm
  19. echo 'Starting vllm api server...'
  20. python -u -m vllm.entrypoints.openai.api_server \
  21. --port 8081 \
  22. --model $MODEL_NAME \
  23. --trust-remote-code \
  24. --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
  25. 2>&1 | tee api_server.log &
  26. echo 'Waiting for vllm api server to start...'
  27. while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
  28. echo 'Starting gradio server...'
  29. git clone https://github.com/vllm-project/vllm.git || true
  30. python vllm/examples/gradio_openai_chatbot_webserver.py \
  31. -m $MODEL_NAME \
  32. --port 8811 \
  33. --model-url http://localhost:8081/v1 \
  34. --stop-token-ids 128009,128001

Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, …):

  1. HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN

Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.

  1. (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

Optional: Serve the 70B model instead of the default 8B and use more GPU:

  1. HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct

Scale up to multiple replicas

SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.

  1. service:
  2. replicas: 2
  3. # An actual request for readiness probe.
  4. readiness_probe:
  5. path: /v1/chat/completions
  6. post_data:
  7. model: $MODEL_NAME
  8. messages:
  9. - role: user
  10. content: Hello! What is your name?
  11. max_tokens: 1

Click to see the full recipe YAML

  1. service:
  2. replicas: 2
  3. # An actual request for readiness probe.
  4. readiness_probe:
  5. path: /v1/chat/completions
  6. post_data:
  7. model: $MODEL_NAME
  8. messages:
  9. - role: user
  10. content: Hello! What is your name?
  11. max_tokens: 1
  12. resources:
  13. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  14. use_spot: True
  15. disk_size: 512 # Ensure model checkpoints can fit.
  16. disk_tier: best
  17. ports: 8081 # Expose to internet traffic.
  18. envs:
  19. MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  20. HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
  21. setup: |
  22. conda create -n vllm python=3.10 -y
  23. conda activate vllm
  24. pip install vllm==0.4.0.post1
  25. # Install Gradio for web UI.
  26. pip install gradio openai
  27. pip install flash-attn==2.5.7
  28. run: |
  29. conda activate vllm
  30. echo 'Starting vllm api server...'
  31. python -u -m vllm.entrypoints.openai.api_server \
  32. --port 8081 \
  33. --model $MODEL_NAME \
  34. --trust-remote-code \
  35. --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
  36. 2>&1 | tee api_server.log &
  37. echo 'Waiting for vllm api server to start...'
  38. while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
  39. echo 'Starting gradio server...'
  40. git clone https://github.com/vllm-project/vllm.git || true
  41. python vllm/examples/gradio_openai_chatbot_webserver.py \
  42. -m $MODEL_NAME \
  43. --port 8811 \
  44. --model-url http://localhost:8081/v1 \
  45. --stop-token-ids 128009,128001

Start the serving the Llama-3 8B model on multiple replicas:

  1. HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN

Wait until the service is ready:

  1. watch -n10 sky serve status vllm

Example outputs:

  1. Services
  2. NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
  3. vllm 1 35s READY 2/2 xx.yy.zz.100:30001
  4. Service Replicas
  5. SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
  6. vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4
  7. vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4

After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

  1. ENDPOINT=$(sky serve status --endpoint 8081 vllm)
  2. curl -L http://$ENDPOINT/v1/chat/completions \
  3. -H "Content-Type: application/json" \
  4. -d '{
  5. "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  6. "messages": [
  7. {
  8. "role": "system",
  9. "content": "You are a helpful assistant."
  10. },
  11. {
  12. "role": "user",
  13. "content": "Who are you?"
  14. }
  15. ],
  16. "stop_token_ids": [128009, 128001]
  17. }'

To enable autoscaling, you could specify additional configs in services:

  1. services:
  2. replica_policy:
  3. min_replicas: 0
  4. max_replicas: 3
  5. target_qps_per_replica: 2

This will scale the service up to when the QPS exceeds 2 for each replica.

Optional: Connect a GUI to the endpoint

It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

Click to see the full GUI YAML

  1. envs:
  2. MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
  3. ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
  4. resources:
  5. cpus: 2
  6. setup: |
  7. conda activate vllm
  8. if [ $? -ne 0 ]; then
  9. conda create -n vllm python=3.10 -y
  10. conda activate vllm
  11. fi
  12. # Install Gradio for web UI.
  13. pip install gradio openai
  14. run: |
  15. conda activate vllm
  16. export PATH=$PATH:/sbin
  17. WORKER_IP=$(hostname -I | cut -d' ' -f1)
  18. CONTROLLER_PORT=21001
  19. WORKER_PORT=21002
  20. echo 'Starting gradio server...'
  21. git clone https://github.com/vllm-project/vllm.git || true
  22. python vllm/examples/gradio_openai_chatbot_webserver.py \
  23. -m $MODEL_NAME \
  24. --port 8811 \
  25. --model-url http://$ENDPOINT/v1 \
  26. --stop-token-ids 128009,128001 | tee ~/gradio.log
  1. Start the chat web UI:
  1. sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
  1. Then, we can access the GUI at the returned gradio link:
  1. | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live