vLLM-Pruna FAQ

Why is vLLM so popular for optimizing LLMs?
vLLM has become one of the most widely adopted inference engines because it delivers strong performance out of the box. When you load a model from Hugging Face into vLLM, it automatically applies two key improvements:

Custom transforms architecture
Default opt-in compilation feature

What happens when I combine Pruna with vLLM?

Pruna adds an additional speed-up on top of vLLM’s optimizations:

+20% with Pruna Open-Source
+50% with Pruna Pro

This acceleration is independent of the model size: whether you’re running a 1B parameter model or a 70B one, you should expect measurable speed-up.

Importantly, this benefit is also independent of vLLM’s optional features (page attention, continuous batching, prefill chunking, etc.), meaning any extra serving optimizations you enable in vLLM will stack with Pruna’s acceleration.

For a smooth start, Pruna provides ready-to-use notebooks and tutorials. You can refer to the official Pruna documentation for technical details.

Why not just use vLLM quantizers directly?

Our quantizers differ from vLLM’s implementation: we guarantee that quantization provides speed-up, not just smaller weights. On top of that, Pruna's quantizers offer broader stability and compatibility where vLLM quantizers may fail, Pruna keeps running.

Examples:

If you try HQQ on Llama-3-8B, vLLM throws an error due to the model size and kernel. With Pruna, it runs without issue.
If you want to use bitsandbytes, in vLLM you’re locked to batch_size = 1. With Pruna, switching to another quantizer only takes 3 lines of code.

Which quantizer should I use?
We recommend HIGGS as the default quantizer: it gives the best balance between throughput and latency efficiency.

For more advanced setups, you can use a dispatcher to dynamically route requests within the same model, but optimized in different ways depending on the use case:

HIGGS → when throughput is critical
HQQ → when latency is critical

(Note: This dispatcher concept isn’t specific to Pruna or vLLM, but rather a general observation.)

Anything else to know?
Yes:

We’re continuously exploring new optimization options, expect more updates soon.
Unlike vLLM, Pruna supports DiffusionLLMs, delivering 3–5× speed-ups compared to base models. You can try these on Replicate today, and we’re happy to help with configuration.