[DynDNNs] Research Progress: Porting Gemmini::tiled_matmul_auto()

Posted Aug 2, 2025 Updated Aug 4, 2025

By 정영신

6 min read

Abstract

This post examines Gemmini’s high-level API function tiled_matmul_auto(), which implements systolic-array-based tiled matrix multiplication with automatic tile-size computation.
We analyze its input parameters and hardware constraints—such as supported data types and 16-byte alignment requirements—to ensure correct invocation within the llama.cpp inference engine.

Introduction

Gemmini is an open-source, full-stack DNN accelerator generator that produces ASIC designs featuring a parameterizable systolic array, banked scratchpad memory, and DMA subsystems for efficient on-chip data movement.
Its tiled_matmul_auto() function automates the selection of tiling factors to maximize PE utilization, but requires arguments that respect Gemmini’s element type support and alignment restrictions.
When porting this function into llama.cpp—a lightweight LLM inference engine handling dynamic tensor shapes—we must first identify the exact argument conditions and fallback behaviors (e.g., CPU emulation) enforced by tiled_matmul_auto() under Gemmini’s hardware constraints.

2. Organization

Parameters of tiled_matmul_auto() — Detailed analysis of each argument and its valid range.
Running Gemmini from llama.cpp (QEMU-Simulated) — Verifying argument handling by initializing and invoking tiled_matmul_auto() within the llama.cpp test harness. The progress is on ggml-gemmini
Conclusion — Summarizing the identified preconditions and fallback pathways.

3. Sections

3.1. Parameters of tiled_matmul_auto()

The prototype of tiled_matmul_auto() is:

  
_STATIC void tiled_matmul_auto(size_t dim_I, size_t dim_J, size_t dim_K,
        const elem_t* A, const elem_t* B,
        const void * D, void * C,
        size_t stride_A, size_t stride_B, size_t stride_D, size_t stride_C,
        scale_t A_scale_factor, scale_t B_scale_factor, scale_acc_t D_scale_factor,
        int act, acc_scale_t scale, acc_scale_t bert_scale,
        bool repeating_bias,
        bool transpose_A, bool transpose_B,
        bool full_C, bool low_D,
        uint8_t weightA,
        enum tiled_matmul_type_t tiled_matmul_type) 

gemmini-rocc-tests/include/gemmini.h

And tiled_matmul_auto() is invoked by tiled_matmul_nn_auto():

  
static void tiled_matmul_nn_auto(size_t dim_I, size_t dim_J, size_t dim_K, 
        const elem_t A[dim_I][dim_K], const elem_t B[dim_K][dim_J],
        const void * D, elem_t C[dim_I][dim_J],
        int act, acc_scale_t scale, bool repeating_bias,
        enum tiled_matmul_type_t tiled_matmul_type,
        bool check, char * layer_name)

gemmini-rocc-tests/include/gemmini_nn.h

tiled_matmul_auto() performs the GEMM(GEneral Matrix to Matrix multiplication) operation C = A × B + D. When neither A nor B is transposed, the logical matrix shapes are:

A: I x K
B: K x J
C: I x J
D: 1 x J bias when repeating_bias == true, otherwise I x J

All stride_* parameters are given in elements (not bytes); Gemmini converts them internally to byte offsets.

Finally, The signature of tiled_matmul_auto() is:

  
static void tiled_matmul_auto(
    size_t               dim_I,            // Number of rows in A and C
    size_t               dim_J,            // Number of columns in B and C
    size_t               dim_K,            // Shared dimension (A’s cols, B’s rows)
    const elem_t*        A,                // Input matrix A (8-bit elements)
    const elem_t*        B,                // Input matrix B (8-bit elements)
    const void*          D,                // Bias matrix D (accumulator type)
    void*                C,                // Output matrix C
    size_t               stride_A,         // Row stride of A (in elements)
    size_t               stride_B,         // Row stride of B
    size_t               stride_D,         // Row stride of D
    size_t               stride_C,         // Row stride of C
    scale_t              A_scale_factor,   // Quantization scale for A
    scale_t              B_scale_factor,   // Quantization scale for B
    scale_acc_t          D_scale_factor,   // Pre-accumulation scale for D
    int                  act,              // Activation function ID
    acc_scale_t          scale,            // Post-activation scale
    acc_scale_t          bert_scale,       // Additional scale for BERT/IGELU
    bool                 repeating_bias,   // True if D is broadcast per row
    bool                 transpose_A,      // True to transpose A
    bool                 transpose_B,      // True to transpose B
    bool                 full_C,           // True to store C in 32-bit
    bool                 low_D,            // True to treat D as 8-bit
    uint8_t              weightA,          // Experimental flag
    tiled_matmul_type_t  tiled_matmul_type // Execution mode: OS / WS / CPU
);

In summary, each parameter defines one aspect of the tiled GEMM:

dim_I, dim_J, dim_K: Logical matrix dimensions
A, B, D, C: Buffer pointers for input/output
stride_: Row strides (in elements) for each buffer
*_scale_factor: Quantization and accumulation scales
act, scale, bert_scale: Activation and post-processing parameters
Flags (repeating_bias, transpose_*, full_C, low_D, weightA): Control data layout, precision, and experimental modes
tiled_matmul_type: Selects dataflow mode or CPU fallback

3.2. Running Gemmini from llama.cpp (QEMU-Simulated)

3.2.1. Background

GGML tensors embed both logical shape and in-memory stride in two small arrays:

field	meaning (for 2-D tensors)
`ne[0]`	number of columns K
`ne[1]`	number of rows I
`nb[0]`	byte stride between adjacent columns (`sizeof(elem)` for dense)
`nb[1]`	byte stride between adjacent rows (`nb[0] * ne[0]` if contiguou)

An I × K matrix is therefore stored as ne = [K, I].
For GEMM we need:

\[C_{I \text{x} J} = A_{I \text{x} K} × B_{K \text{x} J} (+ D)\]

In GGML, the token embedding vector B is already stored in a transposed form. A single token is held as a row-vector 1 × K, i.e. ne = [K, 1].
Gemmini, however, must read a column-vector K × 1;Because the data start out as 1 × K, they are mathematically transposed with respect to the required $K \text{x} J$ shape, and must be converted (or flagged for hardware transpose) before the matmul.

Two ways to meet Gemmini’s expectation:

A. Pre-transpose in ggml : Allocate a new tensor, copy B as K × 1, pad and cast there.
Use transpose_B = true : Keep original 1 × K; let Gemmini flip it.

option	pros / cons
A	+ Straight rows/cols in memory − Extra copy on CPU
B	+ No CPU copy − Gemmini still reads a 1 × K row—>16 × K padded buffer, so row-stride = 16 B must be honoured

We adopt Option A, implemented by:

  
ggml_gemmini_tensor<int8_t> tB(tmp_ctx, src1, ".i8",
                               /*transpose=*/true);

It also casting data from float32 to int8 for Gemmini

inside which we:

Pad the column count
padded_cols = align_up(K, 16 / sizeof(elem_t));
Allocate a 16-byte aligned buffer
Copy & cast B element-wise as K × 1, zero-filling the extra columns.
Overwrite stride

Gemmini’s 16-byte DMA(Direct Memory Access) requirement:

MVIN/MVOUT move one scratchpad row per DMA burst.
AXI port width is 128 bit (16 byte); address and length must be multiples of 16 B.
If a row smaller than 16 B or the stride is not 16-byte aligned, hardware rounds the address down transfers garbage beyond the logical row.

Thus every row must satisfy:

\[ \texttt{row_bytes} = \texttt{padded_cols} \times \texttt{elem_size}, \qquad \texttt{row_bytes} \equiv 0 \pmod{16} \]

data type	`elem_size`	required `padded_cols`
int8	1 B	multiple of 16 elements (16, 32, 48 …)
int32	4 B	multiple of 4 elements (4, 8, 12 …)

Padding with align_up(cols, 16 / elem_size) ensures the rule above, so row_bytes also becomes a 16-byte multiple automatically.

  
tiled_matmul_auto(
    I, J, K,
    tA.data(), tB.data(),        // A, B
    /* D */ nullptr,
    tC.data(),
    stride_A, stride_B,          // element-unit strides
    ...,
    /* transpose_A = false */,
    /* transpose_B = false */);  // already pre-transposed

Key point: Every row now starts on a 16-byte boundary and its length is a 16-byte multiple; Gemmini streams perfect 128-bit bursts, avoiding read-modify-write penalties.

3.2.2. Implementation

Research, DynDNNs

This post is licensed under CC BY 4.0 by the author.