Deploying with Cerebrium

vLLM_plus_cerebrium

vLLM can be run on a cloud based GPU machine with Cerebrium, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.

To install the Cerebrium client, run:

  1. $ pip install cerebrium
  2. $ cerebrium login

Next, create your Cerebrium project, run:

  1. $ cerebrium init vllm-project

Next, to install the required packages, add the following to your cerebrium.toml:

  1. [cerebrium.deployment]
  2. docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
  3. [cerebrium.dependencies.pip]
  4. vllm = "latest"

Next, let us add our code to handle inference for the LLM of your choice(mistralai/Mistral-7B-Instruct-v0.1 for this example), add the following code to your main.py`:

  1. from vllm import LLM, SamplingParams
  2. llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
  3. def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
  4. sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
  5. outputs = llm.generate(prompts, sampling_params)
  6. # Print the outputs.
  7. results = []
  8. for output in outputs:
  9. prompt = output.prompt
  10. generated_text = output.outputs[0].text
  11. results.append({"prompt": prompt, "generated_text": generated_text})
  12. return {"results": results}

Then, run the following code to deploy it to the cloud

  1. $ cerebrium deploy

If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)

  1. curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
  2. -H 'Content-Type: application/json' \
  3. -H 'Authorization: <JWT TOKEN>' \
  4. --data '{
  5. "prompts": [
  6. "Hello, my name is",
  7. "The president of the United States is",
  8. "The capital of France is",
  9. "The future of AI is"
  10. ]
  11. }'

You should get a response like:

  1. {
  2. "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
  3. "result": {
  4. "result": [
  5. {
  6. "prompt": "Hello, my name is",
  7. "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
  8. },
  9. {
  10. "prompt": "The president of the United States is",
  11. "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
  12. },
  13. {
  14. "prompt": "The capital of France is",
  15. "generated_text": " Paris.\n"
  16. },
  17. {
  18. "prompt": "The future of AI is",
  19. "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
  20. }
  21. ]
  22. },
  23. "run_time_ms": 152.53663063049316
  24. }

You now have an autoscaling endpoint where you only pay for the compute you use!