[DynDNNs] Research Progress: Porting Gemmini::tiled_matmul_auto()
Abstract
This post examines Gemmini’s high-level API function tiled_matmul_auto(), which implements systolic-array-based tiled matrix multiplication with automatic tile-size computation.
We analyze its input parameters and hardware constraints—such as supported data types and 16-byte alignment requirements—to ensure correct invocation within the llama.cpp inference engine.
Introduction
Gemmini is an open-source, full-stack DNN accelerator generator that produces ASIC designs featuring a parameterizable systolic array, banked scratchpad memory, and DMA subsystems for efficient on-chip data movement.
Its tiled_matmul_auto() function automates the selection of tiling factors to maximize PE utilization, but requires arguments that respect Gemmini’s element type support and alignment restrictions.
When porting this function into llama.cpp—a lightweight LLM inference engine handling dynamic tensor shapes—we must first identify the exact argument conditions and fallback behaviors (e.g., CPU emulation) enforced by tiled_matmul_auto() under Gemmini’s hardware constraints.
2. Organization
- Parameters of tiled_matmul_auto() — Detailed analysis of each argument and its valid range.
- Running Gemmini from llama.cpp (QEMU-Simulated) — Verifying argument handling by initializing and invoking
tiled_matmul_auto()within thellama.cpptest harness. The progress is on ggml-gemmini - Conclusion — Summarizing the identified preconditions and fallback pathways.
3. Sections
3.1. Parameters of tiled_matmul_auto()
The prototype of tiled_matmul_auto() is:
1
2
3
4
5
6
7
8
9
10
11
_STATIC void tiled_matmul_auto(size_t dim_I, size_t dim_J, size_t dim_K,
const elem_t* A, const elem_t* B,
const void * D, void * C,
size_t stride_A, size_t stride_B, size_t stride_D, size_t stride_C,
scale_t A_scale_factor, scale_t B_scale_factor, scale_acc_t D_scale_factor,
int act, acc_scale_t scale, acc_scale_t bert_scale,
bool repeating_bias,
bool transpose_A, bool transpose_B,
bool full_C, bool low_D,
uint8_t weightA,
enum tiled_matmul_type_t tiled_matmul_type)
gemmini-rocc-tests/include/gemmini.h
And tiled_matmul_auto() is invoked by tiled_matmul_nn_auto():
1
2
3
4
5
6
static void tiled_matmul_nn_auto(size_t dim_I, size_t dim_J, size_t dim_K,
const elem_t A[dim_I][dim_K], const elem_t B[dim_K][dim_J],
const void * D, elem_t C[dim_I][dim_J],
int act, acc_scale_t scale, bool repeating_bias,
enum tiled_matmul_type_t tiled_matmul_type,
bool check, char * layer_name)
gemmini-rocc-tests/include/gemmini_nn.h
tiled_matmul_auto() performs the GEMM(GEneral Matrix to Matrix multiplication) operation C = A × B + D. When neither A nor B is transposed, the logical matrix shapes are:
A: I x KB: K x JC: I x JD: 1 x J bias whenrepeating_bias == true, otherwise I x J
All stride_* parameters are given in elements (not bytes); Gemmini converts them internally to byte offsets.
Finally, The signature of tiled_matmul_auto() is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static void tiled_matmul_auto(
size_t dim_I, // Number of rows in A and C
size_t dim_J, // Number of columns in B and C
size_t dim_K, // Shared dimension (A’s cols, B’s rows)
const elem_t* A, // Input matrix A (8-bit elements)
const elem_t* B, // Input matrix B (8-bit elements)
const void* D, // Bias matrix D (accumulator type)
void* C, // Output matrix C
size_t stride_A, // Row stride of A (in elements)
size_t stride_B, // Row stride of B
size_t stride_D, // Row stride of D
size_t stride_C, // Row stride of C
scale_t A_scale_factor, // Quantization scale for A
scale_t B_scale_factor, // Quantization scale for B
scale_acc_t D_scale_factor, // Pre-accumulation scale for D
int act, // Activation function ID
acc_scale_t scale, // Post-activation scale
acc_scale_t bert_scale, // Additional scale for BERT/IGELU
bool repeating_bias, // True if D is broadcast per row
bool transpose_A, // True to transpose A
bool transpose_B, // True to transpose B
bool full_C, // True to store C in 32-bit
bool low_D, // True to treat D as 8-bit
uint8_t weightA, // Experimental flag
tiled_matmul_type_t tiled_matmul_type // Execution mode: OS / WS / CPU
);
In summary, each parameter defines one aspect of the tiled GEMM:
dim_I,dim_J,dim_K: Logical matrix dimensionsA,B,D,C: Buffer pointers for input/outputstride_: Row strides (in elements) for each buffer*_scale_factor: Quantization and accumulation scalesact,scale,bert_scale: Activation and post-processing parameters- Flags (
repeating_bias,transpose_*,full_C,low_D,weightA): Control data layout, precision, and experimental modes tiled_matmul_type: Selects dataflow mode or CPU fallback
3.2. Running Gemmini from llama.cpp (QEMU-Simulated)
3.2.1. Background
GGML tensors embed both logical shape and in-memory stride in two small arrays:
| field | meaning (for 2-D tensors) |
|---|---|
ne[0] | number of columns K |
ne[1] | number of rows I |
nb[0] | byte stride between adjacent columns (sizeof(elem) for dense) |
nb[1] | byte stride between adjacent rows (nb[0] * ne[0] if contiguou) |
An I × K matrix is therefore stored as ne = [K, I].
For GEMM we need:
In GGML, the token embedding vector B is already stored in a transposed form. A single token is held as a row-vector 1 × K, i.e. ne = [K, 1].
Gemmini, however, must read a column-vector K × 1;Because the data start out as 1 × K, they are mathematically transposed with respect to the required $K \text{x} J$ shape, and must be converted (or flagged for hardware transpose) before the matmul.
Two ways to meet Gemmini’s expectation:
- A. Pre-transpose in
ggml: Allocate a new tensor, copyBasK × 1, pad and cast there. - Use transpose_B = true : Keep original 1 × K; let Gemmini flip it.
| option | pros / cons |
|---|---|
| A | + Straight rows/cols in memory − Extra copy on CPU |
| B | + No CPU copy − Gemmini still reads a 1 × K row—>16 × K padded buffer, so row-stride = 16 B must be honoured |
We adopt Option A, implemented by:
1
2
ggml_gemmini_tensor<int8_t> tB(tmp_ctx, src1, ".i8",
/*transpose=*/true);
It also casting data from
float32toint8for Gemmini
inside which we:
- Pad the column count
padded_cols = align_up(K, 16 / sizeof(elem_t)); - Allocate a 16-byte aligned buffer
- Copy & cast B element-wise as K × 1, zero-filling the extra columns.
- Overwrite stride
Gemmini’s 16-byte DMA(Direct Memory Access) requirement:
- MVIN/MVOUT move one scratchpad row per DMA burst.
- AXI port width is 128 bit (16 byte); address and length must be multiples of 16 B.
- If a row smaller than 16 B or the stride is not 16-byte aligned, hardware rounds the address down transfers garbage beyond the logical row.
Thus every row must satisfy:
\[ \texttt{row_bytes} = \texttt{padded_cols} \times \texttt{elem_size}, \qquad \texttt{row_bytes} \equiv 0 \pmod{16} \]
| data type | elem_size | required padded_cols |
|---|---|---|
| int8 | 1 B | multiple of 16 elements (16, 32, 48 …) |
| int32 | 4 B | multiple of 4 elements (4, 8, 12 …) |
Padding with align_up(cols, 16 / elem_size) ensures the rule above, so row_bytes also becomes a 16-byte multiple automatically.
1
2
3
4
5
6
7
8
9
tiled_matmul_auto(
I, J, K,
tA.data(), tB.data(), // A, B
/* D */ nullptr,
tC.data(),
stride_A, stride_B, // element-unit strides
...,
/* transpose_A = false */,
/* transpose_B = false */); // already pre-transposed
Key point: Every row now starts on a 16-byte boundary and its length is a 16-byte multiple; Gemmini streams perfect 128-bit bursts, avoiding read-modify-write penalties.