Serving with Langchain
vLLM is also available via Langchain .
To install langchain, run
$ pip install langchain langchain_community -q
To run inference on a single or multiple GPUs, use VLLM class from langchain.
from langchain_community.llms import VLLMllm = VLLM(model="mosaicml/mpt-7b",trust_remote_code=True, # mandatory for hf modelsmax_new_tokens=128,top_k=10,top_p=0.95,temperature=0.8,# tensor_parallel_size=... # for distributed inference)print(llm("What is the capital of France ?"))
Please refer to this Tutorial for more details.