“Large language models” (or LLMs for short) are super powerful because they’ve been trained on tons of data and — because they have billions of model parameters — they’re capable of performing remarkably well on a remarkably broad range of natural-language tasks. The best-known LLM, for example — GPT-3 — has 175 billion model parameters. If you’d like to learn all the key info about GPT-3 and its capabilities, check out episode #559 with Melanie Subbiah — one of the first authors of the GPT-3 paper.
Today’s episode isn’t specifically about GPT-3, however. It’s about the issue of how massive these large language models are and how we can prune these models to compress them, giving us a number of advantages including:
Increasing real-time inference speed in production
Decreasing the model size in memory storage
Decreasing compute costs
And, through a common machine learning concept called regularization, potentially even improving the generalization of the model to real-world data that are structurally different from the data the model was trained on
While there are even bigger models today and while the forthcoming GPT-4 is rumored to be several orders of magnitude larger, as the most popular large language model today GPT-3 — with its 175 billion model parameters — serves as a solid exemplar of LLMs in general. For context, 175 billion model parameters take up about 320GB of memory. Thus making production inferences with just one copy of GPT-3 requires about five state-of-the-art Nvidia A100 GPUs, which have 80GB of memory each. Since each one of these GPUs costs $15k, running GPT-3 in production requires $75k worth of GPUs alone.
Clearly, GPT-3’s remarkable capabilities come with a chunky price tag. Thankfully, an exciting new paper from researchers at IST Austria on a parameter-pruning technique called SparseGPT indicates that 100 billion parameters — more than half of GPT-3’s full 175-billion-parameter complement — can be removed without adversely impacting GPT-3’s accuracy. This is a massive improvement over comparable methodologies for pruning LLMs on the scale of GPT-3. Specifically, the previous top GPT pruning approach is called Magnitude Pruning and it is only able to prune 10% of GPT-3 before accuracy begins to take a hit.
Countless different pruning methodologies exist. Some of these pruning techniques are applied before model training, some are applied post-training, while others — historically the best-performing — are iterative and applied throughout model training. SparseGPT is noteworthy not only because of how it can remove more than half of GPT-3’s model parameters without impacting accuracy, but also because it’s easier to apply than these historically-best-performing iterative approaches. SparseGPT can be applied just once, post-training, and so its creators have highlighted this by referring to SparseGPT as a convenient “one-shot” pruning approach.
The details of how SparseGPT works are fairly mathematically complex and detailed in their paper, but the general idea is that pruning is carried out layer by layer. Deep learning models like large language models have many layers of artificial neurons in them and with layer-by-layer pruning, each one of these layers is pruned separately and then the final model is stitched back together by recomposing the compressed layers. The complexity of this approach comes from the mathematics of stitching the layers back together in such a way that the outputs produced by the full-size model are preserved despite the internal structure of the network being changed so drastically.
Being able to halve the size of large language models while retaining accuracy is clearly exciting news that has positive commercial and environmental implications given the widespread use of these models today, powering myriad natural-language processing techniques. Perhaps the most exciting news of all then is that the SparseGPT authors reckon that combined with fine-tuning mechanisms and iterative pruning during training, their one-shot post-training approach could reduce model size by up to 90% without adversely impacting accuracy. In dollar terms, that means we could use about $7500 worth of GPUs to run GPT-3 in production instead of $75,000-worth of them.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.