vLLM
An open-source library for high-throughput LLM inference and serving.
vLLM focuses on scalable serving with features like PagedAttention and CUDA support, as seen in updates like v0.20.0. It competes with Ollama by emphasizing efficiency in high-load scenarios over local deployment. Its decelerating growth highlights maturation in the inference runtime landscape.
Category: project · Also: vllm-project/vllm · Mentioned in 2 Cortex outputs