int-llm: a pure-integer LLM experiment in C

Community Article Published June 3, 2026

This is a short write-up for int-llm, a personal experiment in running small neural-network code with integer arithmetic instead of floating point.

The repository is here:

https://github.com/nmicic/int-llm

The project tries a simple constraint:

Keep the model compute path in fixed-point integer arithmetic, and move human-readable interpretation to the boundary.

The repo currently contains:

a tiny character-level GPT that trains and samples in Q16.48 fixed-point integer math
a CPU Llama-family inference path that can run TinyLlama-1.1B through a Q16.48 integer compute core
a native .mgw integer weight format for reloading converted Q16.48 weights
a standalone integer math library, fp_math.h
a determinism/regression gate
a parked CUDA/GPU branch with results that did not support the original hope

This is not a production runtime. It is slow, intentionally simple, and mostly useful as a reproducible experiment.

One boundary is worth stating clearly: the normal Hugging Face load path starts from safetensors weights and converts them into Q16.48 at load time. Floating point also exists in reference checks, display/profiling code, and other boundary paths. The narrower experiment is whether the model compute path can stay in fixed-point integer arithmetic, and whether the converted native path can reload the model as integer weights.

Why build this?

Most LLM work starts from floating point: FP32, BF16, FP16, FP8, or quantized formats around those. I wanted to test a different question:

How far can a small LLM stack go if the compute core is integer-only?

The design rule was:

Machine-native representation stays in the core; human-readable interpretation happens only at the boundary.

For the integer paths, that means:

no float or double in the compute core
no libm
Q16.48 fixed-point values stored in int64_t
__int128 intermediates for multiply/divide
explicit integer implementations for math functions used by the model

The project is less about being fast and more about making the behavior inspectable and reproducible.

The fixed-point format

The core format is Q16.48:

[63]    [62 .. 48]      [47 .. 0]
sign    integer part    fractional part

That gives:

range around ±32768
resolution of 2^-48, about 3.55e-15
enough fractional precision to preserve normal float16/bfloat16 model weights inside the Q16.48 range

The main math file is fp_math.h. It implements the pieces needed by the toy GPT and Llama path:

multiply/divide with 128-bit intermediates
square root and inverse square root
exp/log
sin/cos via CORDIC
sigmoid and SiLU
deterministic PRNG and sampling helpers

There is also a visual gallery showing the integer-math seed experiments behind the library: e, pi, sqrt(2), and Euler's identity evaluated as fixed-point approximations.

What runs?

The repo has two main runnable paths.

1. Tiny character GPT

This is inspired by Andrej Karpathy's MicroGPT/makemore-style examples. The repo does not ship a dataset; it downloads the public makemore names dataset into input.txt:

make input gpt_int
./gpt_int

That trains a very small character GPT and samples names. The point is not quality; the point is that the training and sampling path runs through integer fixed-point math.

There is also a float32 baseline:

make gpt_float
./gpt_float

2. Integer TinyLlama inference

The larger path is llama_int.c, a CPU-only Llama-family inference engine. With a Hugging Face TinyLlama directory:

huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir models/tinyllama
make llama_int
./llama_int models/tinyllama --generate --prompt "What is the capital of France?" --max-new-tokens 16

Weights are loaded from safetensors and converted into Q16.48 at load time. The engine supports streaming one layer at a time, or caching layers in memory.

There is also a native integer weight format:

./llama_int models/tinyllama --export-native models/tinyllama.mgw
./llama_int models/tinyllama.mgw --native --generate --prompt "What is the capital of France?" --max-new-tokens 16

The .mgw file stores the model as Q16.48 integer weights. In that mode, the original floating-point weights do not reappear at load time.

What is actually proven?

The repo tries to be explicit about the boundary between evidence and speculation.

The main positive results:

TinyLlama-1.1B runs end-to-end through the integer CPU inference path.
The integer path matches the float reference token-for-token on the included greedy verification gate: 80/80 tokens across four benchmark prompts.
The fixed-point math library passes 335 unit/invariant tests.
The determinism gate hashes raw integer outputs and matches the committed golden hash on the tested x86_64/gcc and arm64/clang environments.
The tiny GPT demo trains and samples with integer arithmetic after downloading the small public names dataset.

The main non-claims:

This is not fast.
This is not a production inference engine.
This does not prove chat-quality parity beyond the included verification prompts.
This does not prove that integer arithmetic is always better than floating point.

That last point matters.

The GPU branch: useful negative evidence

The repo also contains a gpu/ directory. It is not part of the main build. It is included as a lab notebook because the negative results are useful.

The GPU experiments asked a separate question:

If the CPU integer path works, should the GPU path also be integer?

The answer, for this hardware and this model, was mostly no.

The FP16 GPU path worked very well:

full TinyLlama decode
80/80 token match against the CPU integer oracle
hundreds of tokens/sec on RTX-class hardware

But FP16 is floating point, so it does not support the integer-only thesis. It is a useful correctness sanity check, not part of the fixed-point integer path.

The integer or lower-precision GPU paths were less convincing:

INT8 tensor-core path with simple per-tensor quantization: 29/80 token match and slower than FP16 in this decode setup.
FP8 E4M3: faster than FP16 overall, but only 36/80 token match.
INT64/Q16.48-style GPU kernels: exactness is possible in principle, but no tensor-core support means the path is not competitive with FP16 tensor cores.
A fair CUDA-core-only comparison of FP32 vs INT32 fixed-point showed parity, not an integer speed advantage. At M=1 decode, both were bandwidth/latency-bound.

That result changed how I think about the experiment. Integer math is not automatically faster. Hardware support matters. On NVIDIA GPUs, FP16/FP8/INT8 tensor cores are the fast path; exact wider integer math is not.

So the GPU folder stays parked: useful evidence, but not part of the integer core.

Why publish it anyway?

Because the repo keeps the experiment small enough to inspect:

a small integer training demo
a larger integer inference demo
a fixed-point math library
a reproducibility gate
a documented GPU branch that did not support the original hope

For me, the useful part is not a claim like "integer LLMs beat floating point." The useful part is narrower:

A Llama-family checkpoint can run through a fixed-point integer CPU inference core and match a float reference on a small token-level gate.

That seemed worth writing down.

What I would improve next

The next clean experiment would be a separate lower-precision SIMD branch: reduce the fixed-point format from Q16.48 to something like Q16.16 or Q8.24, then compare integer SIMD against float SIMD on the same CPU. That would test whether fixed-point arithmetic can be competitive on ordinary CPU hardware while keeping deterministic behavior. I would treat that as follow-up work, not as part of this first release.

Closing

This project is a fun experiment, not a product. It started from a simple question: can I rebuild enough of a tiny LLM stack so the core arithmetic is integer-only?

Within the included checks, it can.

The more honest answer is:

integer-only CPU inference works
reproducibility is the strongest property
GPU acceleration prefers the formats the hardware was built for
some negative results are as useful as the positive one

If you want to inspect or reproduce it, the repo is here:

https://github.com/nmicic/int-llm

Start with:

make input
make all
make regression
./gpt_int

Then read:

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote