Post

[Paper] DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization

[Paper] DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization

This post is a summary of paper, DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization.

Abstract

Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources.

  • Problem:
    • While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision.
  • Purpose of DecDEC:
    • Improving the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory saving and latency reduction.

1. Introduction

  • Traditional Problem & Solution:
    • As Large Language Models sizes increase memory requirements and latency, limiting their use cases.
    • Quantization is a promising solution for reducing the LLM deployment cost, by lowering the model precision; addresses both memory limitations and inference latency.
  • Remaining Problems:
    • Quantization often leads to model quality degradation due to the inevitable loss of information.
    • This is especially true for low-bit settings, such as 3-bit and 4-bit quantization.

Key research question
given a quantized LLM configured with the best possible effort under the memory budget, is there a way to recover the quality loss caused by quantization?

DecDEC (Decoding with Dynamic Error Compensation), an inference scheme for quantized LLMs that dynamically identifies salient channels and compensates for quantization errors in these channels, in real time.


Key Idea of DecDEC:

  1. Store the residuals($W - \hat{W}$) of the quantized weight matrices in CPU memory
  2. Fetch the parts of residuals that correspond to the dynamically identified salient channels
  3. Dynamic error compensation is performed concurrently with inference by an optimized GPU kernel
    → ensure that all additional operations are seamlessly integrated into the existing workflow
    minimize inference slowdown

2. Background

2.1. LLM Inference

Image

  • LLM inference involves two phases: 1. prefill 2. decode
  • The decode phase is particularly memory-bound, as only one token is processed at a time.

2.2. LLM Quantization

  • Main types of Quantization for LLMs:
    1. Weight-activation quantization
      • Quantize both weights and activations.
    2. Weight-only quantization
      • Quantized weights are load from memory.
      • Dequantized on-the-fly to full precision, before being multiplied with the full-precision activations.
QuantizationUsed inEffects
Weight-activationDatacenter settings- Efficient use of low-precision arithmetic units available
on modern GPUs
Weight-onlyOn-device inference- Only reduces memory traffic
- Sufficient to speed up on-device inference
  • Sub-categories of Weight-only quantization:
    1. QAT (Quantization-Aware-Training)
      • Retraining to reduce quantization error.
      • Its cost makes it impractical for many end-users.
    2. PTQ (Post-Training-Quantization)
      • Doesn’t require retraining.
      • Preferred method for on-device LLM inference.

This paper focuses on weight-only PTQ

3. Augmenting Quantized LLM with CPU Memory

3.1. Concept

  • Goal: To leverage CPU memory to improve quantized LLM quality without additional GPU memory costs.

Image

  • Basic Mechanism:
    • Target desktop or laptop platforms, where the GPU is connected to the CPU via PCIe interconnect.
    • Quantized weight parameters($\boldsymbol{W}$) and activations($\boldsymbol{x}$) are kept in GPU memory (as in conventional inference systems).
    • $\boldsymbol{R}(= W - \hat{W})$: The residual between the original full-precision weights and the quantized weights
    • Due to the limited bandwidth of PCIe
      • only a small subset of residuals should be fetched in a selective manner.
      • Each linear operation: $\boldsymbol{\hat{W}x}$ → $(\boldsymbol{\hat{W}}+\boldsymbol{R}\odot \boldsymbol{M})\boldsymbol{x}$
      • $\boldsymbol{M}$ sparsifies $\boldsymbol{R}$ as a binary mask.

3.2. Opportunity: Not All Residuals Are Equally Important

Image

  • When certain activation values are noticeably large
    → Even small quantization errors in the corresponding weight channels can be multiplied and amplified as shown in Figure 3.
    → resulting in considerable perturbations in the output.

salient channels: Channels whose input activations have outlier magnitudes such that small weight-quantization errors in those channels are multiplied and amplified, causing large output perturbations.

  • Constructing M:
    • Input channel granularity based on the magnitude of input activations.
    • Satisfies two key conditions:
      1. select impactful portions of the residuals
      2. maintain a structured form

Image

  • Figure 4
    • Key Point: Prioritizing salient channels based on activation magnitude is effective.
    • Error: Mean squared error between the computation result with FP16 weights ($\boldsymbol{Wx}$) and quantized weights ($\boldsymbol{\hat{W}x}$)
    • x-axis: Cumulated number of input channels replaced with their FP16 counterparts.
  • Analysis:
    • The error is reduced by sequentially replacing the input channels of the quantized weight ($\boldsymbol{\hat{W}}$) with their corresponding FP16 values ($\boldsymbol{W}$).
    • Quantization error curves
      • Solid red/blue(sorted): Rapid drop when channels compensated in descending order of their activation magnitudes.
      • Dotted(random): Significantly slower reduction when channels are compensated in random order.

3.3. Challenge: Dynamic Nature of Activation Outliers

4. DecDEC Design

5. Evaluation

This post is licensed under CC BY 4.0 by the author.