Tutorial: Sparsity in Large Language Models: The New Odyssey (Part 2 of 4)
Shiwei Liu, Olga Saukh, Zhangyang (Atlas) Wang, Arijit Ukil, and Angshul Majumdar
This tutorial will provide a comprehensive overview of recent breakthroughs of sparsity in the emerging area of large language models (LLMs), showcasing progress and posing challenges, and endeavor to provide insights to improve the affordability and knowledge of LLMs through sparsity. The outline of this tutorial is fourfold: (1) a thorough overview/categorization of sparse neural networks; (2) the latest progress of LLMs compression via sparsity; (3) the caveat of sparsity in LLMs; and finally (4) the benefits of sparsity beyond model efficiency.
The detailed outline is given below:
Tutorial Introduction. Presenter: Zhangyang (Atlas) Wang.
Part 1: Overview of sparse neural networks. Presenter: Shiwei Liu.
We will first provide a brief overview and categorization of existing works on sparse neural networks. As one of the most classical concepts in machine learning, the pristine goal of sparsity in neural networks is to reduce inference costs. However, the research focus on sparsity has undertaken a significant shift from post-training sparsity to prior-training sparsity over the past few years, due to the latter's promise of end-to-end resource saving from training to inference. Researchers have tackled many interlinked concepts such as pruning [13], Lottery Ticket Hypothesis [14], Sparse Training [15,16], Pruning at Initialization [17], and Mixture of Experts [18]. However, the shift of interest only occurred in the last few years, and the relationships among different sparse algorithms in terms of their scopes, assumptions, and approaches are highly intricate and sometimes ambiguous. Providing a comprehensive and precise categorization of these approaches is timely for this newly shaped research community.
Part 2: Scaling up sparsity to LLMs: latest progress. Presenter: Shiwei Liu.
In the context of gigantic LLMs, sparsity is becoming even more appealing to accelerate both training and inference. We will showcase existing attempts that address sparse LLMs, encompassing weight sparsity, activation sparsity, and memory sparsity. For example, SparseGPT [8] and Essential Sparsity [9] shed light on prominent weight sparsity in LLMs, while the unveiling of ''Lazy Neuron" [13] and ''Heavy Hitter Oracle" [10] exemplifies activation sparsity and token sparsity. Specifically, the introduction of Essential Sparsity discovers a consistent pattern across various settings, that is, 30%-50% of weights from LLMs can be removed by the naive one-shot magnitude pruning for free without any significant drop in performance. Ultimately, those observations suggest that sparsity is also an emerging property in the context of LLMs, with great potential to improve the affordability of LLMs.
Coffee Break.
Part 3: The caveat of sparsity in LLMs: What tasks are we talking about? Presenter: Zhangyang (Atlas) Wang.
While sparsity has demonstrated its success in LLMs, the commonly used evaluation in the literature of sparse LLMs are often restricted to simple datasets such as GLUE, Squad, WikiText-2, and PTB; and/or simple one-turn question/instructions. Such (over-) simplified evaluations may potentially camouflage some unexpected predicaments of sparse LLMs. To depict the full picture of sparse LLMs, we highlight two recent works, SMC-Bench [11] and ''Junk DNA Hypothesis", that unveil the failures of (magnitude-based) pruned LLMs on harder language tasks, indicating a strong correlation between the model's ''prunability" and its target downstream task's difficulty.
Part 4: Sparsity beyond efficiency. Presenter: Olga Saukh.
In addition to efficiency, sparsity has been found to boost many other performance aspects such as robustness, uncertainty quantification, data efficiency, multitasking and task transferability, and interoperability [19]. We will mainly focus on the recent progress in understanding the relation between sparsity and robustness. The research literature spans multiple subfields, including empirical and theoretical analysis of adversarial robustness [20], regularization against overfitting, and noisy label resilience for sparse neural networks. By outlining these different aspects, we aim to offer a deep dive into how network sparsity affects the multi-faceted utility of neural networks in different scenarios.
Part 5: Demonstration and Hands-on Experience. Presenter: Shiwei Liu.
The Expo consists of three main components: Firstly, an implementation tutorial will be presented via a typical laptop offering step-by-step guidance in building and training sparse neural networks from scratch. Secondly, a demo will be given to showcase how to prune LLaMA-7B on a single A6000 GPU. Thirdly, we will create and maintain user-friendly open-source implementation for sparse LLMs, ensuring participants have ongoing resources at their disposal. To encourage ongoing engagement and learning, we will make all content and materials readily accessible through the tutorial websites.