Skip to content
Pruna AI Customer Support Portal home
Pruna AI Customer Support Portal home

SmolLM2-135M-Instruct

Regarding latency, the 4-bit AWQ config offers the fastest inference, with 62.2ms sync latency—a 2× speedup over the base (129.6ms) and significantly lower than most others.

Regarding memory savings, 2-bit and 4-bit GPTQ configs reduce inference memory from 772MB to just 94MB, an 8.2× reduction, while staying close to baseline quality.

In terms of emissions, the AWQ 4-bit config also leads with the lowest CO₂ footprint at 0.000043, cutting emissions by ~55% compared to the base (0.000095).

Try it on your setup.

smash_config["quantizer"] = "quanto" smash_config["device"] = "cpu"

Read the complete benchmark: https://www.pruna.ai/blog/smollm2-smaller-faster