思维导图备注
vLLM v0.7.2 Documentation
首页 AI助手 白天 夜间 BookChat 小程序 小程序 阅读
  • 书签 我的书签
  • 添加书签 添加书签 移除书签 移除书签

Getting Started

 Sponsor  来源:vLLM 浏览 739 扫码 2025-02-09 13:17:40
  • Installation
  • Quickstart
  • Examples
  • Troubleshooting
  • Frequently Asked Questions
当前内容版权归 vLLM 或其关联方所有,如需对内容或内容相关联开源项目进行关注与资助,请访问 vLLM .
上一篇:
下一篇:
  • 书签
  • 添加书签 移除书签
  • vLLM v0.7.2 Documentation
  • vLLM v0.7.1 Documentation
  • vLLM v0.7.0 Documentation
  • vLLM v0.6.6 Documentation
  • vLLM v0.6.5 Documentation
  • vLLM v0.6.4 Documentation
  • vLLM v0.6.3 Documentation
  • vLLM v0.6.2 Documentation
  • vLLM v0.6.1 Documentation
  • vLLM v0.6.0 Documentation
  • vLLM v0.5.5 Documentation
  • vLLM v0.5.4 Documentation
  • vLLM v0.5.3 Documentation
  • vLLM v0.5.2 Documentation
  • vLLM v0.5.1 Documentation
  • vLLM v0.5.0 Documentation
  • vLLM v0.4.3 Documentation
  • vLLM v0.4.2 Documentation
  • vLLM v0.4.1 Documentation
  • Getting Started
    • Installation
      • GPU
      • CPU
      • Other AI accelerators
    • Quickstart
    • Examples
      • Offline Inference
        • AQLM Example
        • Arctic
        • Audio Language
        • Basic
        • Basic With Model Default Sampling
        • Chat
        • Chat With Tools
        • Classification
        • CLI
        • CPU Offload
        • Distributed
        • Embedding
        • Encoder Decoder
        • Florence2 Inference
        • GGUF Inference
        • LLM Engine Example
        • LoRA With Quantization Inference
        • MLPSpeculator
        • MultiLoRA Inference
        • Neuron
        • Neuron INT8 Quantization
        • Offline Inference with the OpenAI Batch file format
        • Pixtral
        • Prefix Caching
        • Profiling
        • vLLM TPU Profiling
        • Rlhf
        • Save Sharded State
        • Scoring
        • Simple Profiling
        • Structured Outputs
        • Torchrun Example
        • TPU
        • Vision Language
        • Vision Language Embedding
        • Vision Language Multi Image
        • Whisper
      • Online Serving
        • API Client
        • Helm Charts
        • Cohere Rerank Client
        • Disaggregated Prefill
        • Gradio OpenAI Chatbot Webserver
        • Gradio Webserver
        • Jinaai Rerank Client
        • OpenAI Chat Completion Client
        • OpenAI Chat Completion Client For Multimodal
        • OpenAI Chat Completion Client With Tools
        • OpenAI Chat Completion Structured Outputs
        • OpenAI Chat Completion With Reasoning
        • OpenAI Chat Completion With Reasoning Streaming
        • OpenAI Chat Embedding Client For Multimodal
        • OpenAI Completion Client
        • OpenAI Cross Encoder Score
        • OpenAI Embedding Client
        • OpenAI Pooling Client
        • Setup OpenTelemetry POC
        • Prometheus and Grafana
        • Run Cluster
        • Sagemaker-Entrypoint
      • Other
        • Logging Configuration
        • Tensorize vLLM Model
    • Troubleshooting
    • Frequently Asked Questions
  • Models
    • Generative Models
    • Pooling Models
    • List of Supported Models
    • Built-in Extensions
      • Loading models with Run:ai Model Streamer
      • Loading models with CoreWeave’s Tensorizer
  • Features
    • Quantization
      • Supported Hardware
      • AutoAWQ
      • BitsAndBytes
      • GGUF
      • INT4 W4A16
      • INT8 W8A8
      • FP8 W8A8
      • Quantized KV Cache
    • LoRA Adapters
    • Tool Calling
    • Reasoning Outputs
    • Structured Outputs
    • Automatic Prefix Caching
    • Disaggregated Prefilling (experimental)
    • Speculative Decoding
    • Compatibility Matrix
  • Inference and Serving
    • Offline Inference
    • OpenAI-Compatible Server
    • Multimodal Inputs
    • Distributed Inference and Serving
    • Production Metrics
    • Engine Arguments
    • Environment Variables
    • Usage Stats Collection
    • External Integrations
      • LangChain
      • LlamaIndex
  • Deployment
    • Using Docker
    • Using Kubernetes
    • Using Nginx
    • Using other frameworks
      • BentoML
      • Cerebrium
      • dstack
      • Helm
      • LWS
      • Modal
      • SkyPilot
      • NVIDIA Triton
    • External Integrations
      • KServe
      • KubeAI
      • Llama Stack
  • Performance
    • Optimization and Tuning
    • Benchmark Suites
  • Design Documents
    • Architecture Overview
    • Integration with HuggingFace
    • vLLM’s Plugin System
    • vLLM Paged Attention
    • Multi-Modal Data Processing
    • Automatic Prefix Caching
    • Python Multiprocessing
  • V1 Design Documents
    • Automatic Prefix Caching
  • Developer Guide
    • Contributing to vLLM
    • Profiling vLLM
    • Dockerfile
    • Adding a New Model
      • Implementing a Basic Model
      • Registering a Model to vLLM
      • Writing Unit Tests
      • Multi-Modal Support
    • Vulnerability Management
  • API Reference
    • Offline Inference
      • LLM Class
      • LLM Inputs
    • vLLM Engine
      • LLMEngine
      • AsyncLLMEngine
    • Inference Parameters
    • Multi-Modality
      • Input Definitions
      • Data Parsing
      • Data Processing
      • Memory Profiling
      • Registry
    • Model Development
      • Base Model Interfaces
      • Optional Interfaces
      • Model Adapters
  • Community
    • vLLM Blog
    • vLLM Meetups
    • Sponsors
暂无相关搜索结果!

    本文档使用 BookStack 构建

    文章二维码

    手机扫一扫,轻松掌上读

    文档下载

    • 普通下载
    • 下载码下载(免登录无限下载)
    你与大神的距离,只差一个APP
    APP下载
    请下载您需要的格式的文档,随时随地,享受汲取知识的乐趣!
    PDF文档 EPUB文档 MOBI文档
    温馨提示 每天每在网站阅读学习一分钟时长可下载一本电子书,每天连续签到可增加阅读时长
    下载码方式下载:免费、免登录、无限制。 免费获取下载码

    微信小程序阅读

    BookChat 微信小程序阅读
    您与他人的薪资差距,只差一个随时随地学习的小程序

    书签列表

      阅读记录

      阅读进度: 0.00% ( 0/0 ) 重置阅读进度

        欢迎使用AI助手 AI助手
        全屏 缩小 隐藏 清空