Distributed Inference and Serving

vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with:

  1. $ pip install ray

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:

  1. from vllm import LLM
  2. llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
  3. output = llm.generate("San Franciso is a")

To run multi-GPU serving, pass in the --tensor-parallel-size argument when starting the server. For example, to run API server on 4 GPUs:

  1. $ python -m vllm.entrypoints.api_server \
  2. $ --model facebook/opt-13b \
  3. $ --tensor-parallel-size 4

To scale vLLM beyond a single machine, start a Ray runtime via CLI before running vLLM:

  1. $ # On head node
  2. $ ray start --head
  3. $ # On worker nodes
  4. $ ray start --address=<ray-head-address>

After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.