Many recent episodes have been focused on open-source Large Language Models that you can download and fine-tune to particular use cases depending on your needs or your users’ needs. I’ve particularly been highlighting LLMs with seven billion up to 13 billion model parameters because this size of model can typically be run on a single consumer GPU so it’s relatively manageable and affordable both to train and have in production.
To approach the capabilities of the top commercial LLMs like OpenAI’s GPT-4 and Anthropic’s Claude on a broad range of tasks, you may however need to use a much larger open-source LLM. So, wouldn’t it be nice if you could compress the size of these larger LLMs to be able to fit them on a single consumer GPU? Such compression would enable you to:
Decrease training costs
Decrease model size for storage
Increase inference speed
And, in some cases, compression can even act as a regularizer so it improves the model’s generalizability to data it hasn’t encountered before
The problem is that, historically, compressing our model leads to lower accuracy. That changed earlier this month with a paper published by international collaborators from both academia and industry revealing their SpQR approach (a “Sparse-Quantized Representation”) for near-lossless LLM weight compression. The authors demonstrate being able to run a 33B-parameter LLM on a single 24 GB GPU while simultaneously allowing a 15% speedup of inference and, critically, no reduction in model accuracy. To do this, they leveraged a widely-used approach called quantization.
Model quantization is a process that reduces the memory and computational requirements of a deep learning model by representing model parameters and model activations with lower precision values, such as using integers or fixed-point numbers in the place of higher-precision floating-point numbers. This quantization both reduces memory usage and speeds up computations.
SpQR is the first quantization method that can reach the compression ratios of other quantization methods (up to 4x) while being near-lossless. There are four steps to this new SpQR algorithm:
Iterate through the layers of the model and quantize the weights by converting them to a lower-bit representation.
For each layer, measure the inputs and outputs of the quantized model and compare these with the uncompressed model.
Identify the weights whose quantization results in an outsized impact on layer output behavior. These weights are considered outliers.
In the final step, most of the weights (typically greater than 99%) are converted to a low-bitwidth representation. The outliers, which have been identified as having an outsized impact by the previous step, are extracted separately and left in their higher-precision representation.
The rationale behind this process is that, in most cases fewer than 1% of the outlier weights result in over 75% of the overall error introduced by quantization. Since these weights lead to high, irreducible error, SpQR just keeps them intact. Since these outliers account for typically fewer than 1% of the parameters in the model, retaining them has a negligible impact on compression while simultaneously avoiding any noticeable reduction in the model’s accuracy.
Finally, if you’re not just interested in compressing your model for deploying it to production, but you’re also interested in fine-tuning a big open-source LLM — say, a 33B or larger model — you’ll also want to check out QLoRA. This builds on the parameter-efficient low-rank adaptation (LoRA) that I introduced back in Episode #674, but now also incorporates quantization so that you can fine-tune open-source 33B- or even 65B-parameter models on a single 48GB GPU. The QLoRA authors made a big splash a few weeks ago when they claimed this enabled their Guanaco family of models to approach 99.3% of ChatGPT’s performance on the Vicuña benchmarks that I covered back in Episode #672. The QLoRA approach is already integrated with Hugging Face’s PEFT and Transformers libraries; refer to their GitHub repo for all the information, including access to their new Guanaco model family, which comes in four sizes: 7B, 13B, 33B and 65B. The Guanaco family were fine-tuned starting with Meta’s LLaMA models so, as detailed back in Episode #670, they can’t be used for commercial purposes, but now you can apply QLoRA to a commercial-use model like Dolly 2.0 and fine-tune it to your desired use case.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.