Manual vs. Automated Smashing

When compressing a model with Pruna, you can manually configure everything or let the Optimization Agent do the heavy lifting. Here's how the two approaches compare:

Manual Smashing

In manual mode, you define your compression setup by specifying which algorithms to use and how to tune their hyperparameters. This approach gives you complete control, but also requires deep knowledge of:

How each compression technique works
How to set optimal hyperparameters for your use case

This method best suits expert ML engineers who want fine-grained control over their optimization process.


smash_config = SmashConfig()
smash_config["cacher"] = "flux_caching" # configure the compression configuration
smashed_model = smash(model=base_model, smash_config=smash_config)

⚠️ In this example, you might think using a single cacher is enough, but you could miss out on more powerful combinations of techniques that significantly improve performance when used together.

Automated Smashing


# Load the base model
base_model = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", **loading_kwargs)

# Define the task
task = Task([
    "psnr", # Quality evaluation
    "elapsed_time", # Speed evaluation
], 
datamodule=datamodule)  # configure the metrics of interest for this use case

# Start the optimization agent
optimization_agent = OptimizationAgent(model=base_model, task=task)
smashed_model = optimization_agent.probabilistic_search(n_trials=10)

Compression can get complex fast — the configuration space is exponentially ample, and the best setup depends on:

Your model architecture
The target hardware
Your optimization objectives (e.g., quality, latency, memory)

That’s where the Optimization Agent comes in. It automates the process by:

Detecting compatible compression techniques for your model and hardware
Choosing the best combinations of algorithms
Tuning hyperparameters based on your objectives

All you need to do is:

Load your base model
Define the task (which includes your evaluation metrics)
Launch the optimization

Reviewing the Results

Once the optimization is complete, you can visualize the results using a Pareto front plot, which shows all trade-offs between speed, quality, and memory. Pick the best configuration for your needs. Voilà!

Capture d'écran 3 juil 2025 de Write up.png