Speculative decoding in vLLM

Warning

Please note that speculative decoding in vLLM is not yet optimized and does not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work to optimize it is ongoing and can be followed in this issue.

This document shows how to use Speculative Decoding with vLLM. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

Speculating with a draft model

The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.

  1. from vllm import LLM, SamplingParams
  2. prompts = [
  3. "The future of AI is",
  4. ]
  5. sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
  6. llm = LLM(
  7. model="facebook/opt-6.7b",
  8. tensor_parallel_size=1,
  9. speculative_model="facebook/opt-125m",
  10. num_speculative_tokens=5,
  11. use_v2_block_manager=True,
  12. )
  13. outputs = llm.generate(prompts, sampling_params)
  14. for output in outputs:
  15. prompt = output.prompt
  16. generated_text = output.outputs[0].text
  17. print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Speculating by matching n-grams in the prompt

The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt. For more information read this thread.

  1. from vllm import LLM, SamplingParams
  2. prompts = [
  3. "The future of AI is",
  4. ]
  5. sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
  6. llm = LLM(
  7. model="facebook/opt-6.7b",
  8. tensor_parallel_size=1,
  9. speculative_model="[ngram]",
  10. num_speculative_tokens=5,
  11. ngram_prompt_lookup_max=4,
  12. use_v2_block_manager=True,
  13. )
  14. outputs = llm.generate(prompts, sampling_params)
  15. for output in outputs:
  16. prompt = output.prompt
  17. generated_text = output.outputs[0].text
  18. print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Resources for vLLM contributors