Deploying with dstack

vLLM_plus_dstack

vLLM can be run on a cloud based GPU machine with dstack, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.

To install dstack client, run:

  1. $ pip install "dstack[all]
  2. $ dstack server

Next, to configure your dstack project, run:

  1. $ mkdir -p vllm-dstack
  2. $ cd vllm-dstack
  3. $ dstack init

Next, to provision a VM instance with LLM of your choice(NousResearch/Llama-2-7b-chat-hf for this example), create the following serve.dstack.yml file for the dstack Service:

  1. type: service
  2. python: "3.11"
  3. env:
  4. - MODEL=NousResearch/Llama-2-7b-chat-hf
  5. port: 8000
  6. resources:
  7. gpu: 24GB
  8. commands:
  9. - pip install vllm
  10. - vllm serve $MODEL --port 8000
  11. model:
  12. format: openai
  13. type: chat
  14. name: NousResearch/Llama-2-7b-chat-hf

Then, run the following CLI for provisioning:

  1. $ dstack run . -f serve.dstack.yml
  2. Getting run plan...
  3. Configuration serve.dstack.yml
  4. Project deep-diver-main
  5. User deep-diver
  6. Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
  7. Max price -
  8. Max duration -
  9. Spot policy auto
  10. Retry policy no
  11. # BACKEND REGION INSTANCE RESOURCES SPOT PRICE
  12. 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
  13. 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
  14. 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
  15. ...
  16. Shown 3 of 193 offers, $5.876 max
  17. Continue? [y/n]: y
  18. Submitting run...
  19. Launching spicy-treefrog-1 (pulling)
  20. spicy-treefrog-1 provisioning completed (running)
  21. Service is published at ...

After the provisioning, you can interact with the model by using the OpenAI SDK:

  1. from openai import OpenAI
  2. client = OpenAI(
  3. base_url="https://gateway.<gateway domain>",
  4. api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
  5. )
  6. completion = client.chat.completions.create(
  7. model="NousResearch/Llama-2-7b-chat-hf",
  8. messages=[
  9. {
  10. "role": "user",
  11. "content": "Compose a poem that explains the concept of recursion in programming.",
  12. }
  13. ]
  14. )
  15. print(completion.choices[0].message.content)

Note

dstack automatically handles authentication on the gateway using dstack’s tokens. Meanwhile, if you don’t want to configure a gateway, you can provision dstack Task instead of Service. The Task is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out this repository