Wanda++: Pruning Large Language Models via Regional Gradients

Yifan Yang◊†, Kai Zhen♣†, Bhavana Ganesh, Aram Galstyan, Goeric Huybrechts, Markus Müller, Jonas M. Kübler, Rupak Vignesh Swaminathan, Athanasios Mouchtaris, Sravan Babu Bodapati, Nathan Susanj, Zheng Zhang, Jack FitzGerald, Abhishek Kumar
University of California, Santa Barbara Amazon AGI Equal Contributions

Wanda++ can be applied after post-training architectural changes (e.g., pruning, dense-to-MoE) to quickly mitigate degradation before costly recovery training.

Abstract

Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without fullmodel sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level regional gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsityaware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.

Method Overview

Wanda++ Pipeline

In this paper, we demonstrate how Wanda++ can be used to reduce performance degradation with the use of decoder-level regional gradients from model pruning. Wanda++ prunes the model by iteratively applying our regional gradient score (RGS) and a regional optimization (RO) method.

Wanda++ Pipeline Diagram

Wikitext Perplexity and Downstream Tasks Performance

Method Sparsity LLaMA-1 OpenLLaMA LLaMA-3.1
7B13B30B65B 3B7B70B8B
Baseline-5.685.094.773.567.276.494.306.39
SparseGPT*0.57.226.215.314.5710.418.57--
Wanda*7.266.155.244.5712.379.155.259.99
GBLM7.156.115.18-10.758.49-9.90
Wanda++ RO7.076.085.124.439.868.275.149.34
Wanda++ RGS7.186.125.154.4810.788.505.199.92
Wanda++7.02 (-3%)6.00 (-2%)5.10 (-3%)4.43 (-3%)9.25 (-25%)7.82 (-15%)5.11 (-3%)9.22 (-7%)
SparseGPT*2:411.009.117.166.2815.9111.62--
Wanda*11.539.586.906.2528.0415.356.4724.83
GBLM11.339.166.87-24.7513.19-24.34
Wanda++ RO10.787.896.515.8619.4111.696.3719.43
Wanda++ RGS11.469.446.936.2324.7713.276.4024.54
Wanda++9.43 (-19%)7.75 (-20%)6.39 (-7%)5.59 (-11%)19.03 (-32%)11.30 (-26%)6.35 (-2%)18.32 (-26%)
SparseGPT*4:88.617.406.175.3812.209.79--
Wanda*8.577.405.975.3016.8311.385.7314.63
GBLM8.487.265.89-14.8610.38-14.29
Wanda++ RO8.347.185.735.1113.109.525.6712.88
Wanda++ RGS8.587.335.905.1714.9210.425.7014.32
Wanda++7.88 (-8%)6.75 (-9%)5.65 (-5%)5.07 (-4%)12.54 (-25%)9.42 (-17%)5.65 (-1%)12.55 (-14%)

Table 1: Wikitext perplexity comparison on LLaMA-1, OpenLLaMA, and LLaMA-3.1 model families. * indicates results from either the previous papers. - means results are not available due to OOM or source code limitations. Bold numbers highlight ≥5% relative improvements over Wanda.

Method Wic Mrpc Hellaswag Arc_easy Arc_challenge Winogrande BoolQ RTE MMLU
Baseline 49.8469.1256.9675.2941.8070.0075.0266.4335.10
Wanda 48.7546.8141.6659.3427.4761.9669.6049.8225.85
GBLM 49.3265.3141.8061.4330.4563.2471.2057.4326.34
Wanda++ RGS 49.37 (1%)64.46 (38%)41.43 (-1%)62.42 (5%)31.06 (13%) 62.83 (1%)67.95 (-2%)58.48 (17%)26.40 (-2%)
Wanda++ 50.00 (2%)68.38 (46%)45.31 (8%)63.72 (7%)29.27 (6%) 65.04 (4%)67.80 (-2%)62.09 (24%)27.52 (6%)

Table 2: Accuracy (%) from LLaMA-1 7B under 2:4 sparsity in zero-shot setting. Bold values indicate best performance or ≥5% relative improvement over Wanda.

Combined with Recovery Training

We show the Wanda++ can be combined with further recovery training to improve the performance of the pruned model. In this case, we use LoRA to recover the performance of the pruned model and show Wanda++ w. LoRA perform better than Wanda w. LoRA . However, other post-training methods like DPO, GRPO can also be combined for the recovery training.

Methods Dense Pruned Model After LoRA-tuned
Wanda 5.68 11.59 8.23 (-29%)
Wanda++ 5.68 9.43 6.88 (-27%)

Table 3: Perplexity comparison on Wikitext with LoRA. All experiments are conducted on LLaMA-7B with 2:4 sparsity.

BibTeX

@article{yang2025wanda++,
  title={Wanda++: Pruning large language models via regional gradients},
  author={Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{\"u}ller, Markus and K{\"u}bler, Jonas M and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and others},
  journal={arXiv preprint arXiv:2503.04992},
  year={2025}
}