int-llm: a pure-integer LLM experiment in C
The repository is here:
https://github.com/nmicic/int-llm
The project tries a simple constraint:
Keep the model compute path in fixed-point integer arithmetic, and move human-readable interpretation to the boundary.
The repo currently contains:
- a tiny character-level GPT that trains and samples in Q16.48 fixed-point integer math
- a CPU Llama-family inference path that can run TinyLlama-1.1B through a Q16.48 integer compute core
- a native
.mgwinteger weight format for reloading converted Q16.48 weights - a standalone integer math library,
fp_math.h - a determinism/regression gate
- a parked CUDA/GPU branch with results that did not support the original hope
This is not a production runtime. It is slow, intentionally simple, and mostly useful as a reproducible experiment.
One boundary is worth stating clearly: the normal Hugging Face load path starts from safetensors weights and converts them into Q16.48 at load time. Floating point also exists in reference checks, display/profiling code, and other boundary paths. The narrower experiment is whether the model compute path can stay in fixed-point integer arithmetic, and whether the converted native path can reload the model as integer weights.
Why build this?
Most LLM work starts from floating point: FP32, BF16, FP16, FP8, or quantized formats around those. I wanted to test a different question:
How far can a small LLM stack go if the compute core is integer-only?
The design rule was:
Machine-native representation stays in the core; human-readable interpretation happens only at the boundary.
For the integer paths, that means:
- no
floatordoublein the compute core - no
libm - Q16.48 fixed-point values stored in
int64_t __int128intermediates for multiply/divide- explicit integer implementations for math functions used by the model
The project is less about being fast and more about making the behavior inspectable and reproducible.
The fixed-point format
The core format is Q16.48:
[63] [62 .. 48] [47 .. 0]
sign integer part fractional part
That gives:
- range around ±32768
- resolution of 2^-48, about 3.55e-15
- enough fractional precision to preserve normal float16/bfloat16 model weights inside the Q16.48 range
The main math file is fp_math.h. It implements the pieces needed by the toy GPT and Llama path:
- multiply/divide with 128-bit intermediates
- square root and inverse square root
- exp/log
- sin/cos via CORDIC
- sigmoid and SiLU
- deterministic PRNG and sampling helpers
There is also a visual gallery showing the integer-math seed experiments behind the library: e, pi, sqrt(2), and Euler's identity evaluated as fixed-point approximations.
What runs?
The repo has two main runnable paths.
1. Tiny character GPT
This is inspired by Andrej Karpathy's MicroGPT/makemore-style examples. The repo does not ship a dataset; it downloads the public makemore names dataset into input.txt:
make input gpt_int
./gpt_int
That trains a very small character GPT and samples names. The point is not quality; the point is that the training and sampling path runs through integer fixed-point math.
There is also a float32 baseline:
make gpt_float
./gpt_float
2. Integer TinyLlama inference
The larger path is llama_int.c, a CPU-only Llama-family inference engine. With a Hugging Face TinyLlama directory:
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir models/tinyllama
make llama_int
./llama_int models/tinyllama --generate --prompt "What is the capital of France?" --max-new-tokens 16
Weights are loaded from safetensors and converted into Q16.48 at load time. The engine supports streaming one layer at a time, or caching layers in memory.
There is also a native integer weight format:
./llama_int models/tinyllama --export-native models/tinyllama.mgw
./llama_int models/tinyllama.mgw --native --generate --prompt "What is the capital of France?" --max-new-tokens 16
The .mgw file stores the model as Q16.48 integer weights. In that mode, the original floating-point weights do not reappear at load time.
What is actually proven?
The repo tries to be explicit about the boundary between evidence and speculation.
The main positive results:
- TinyLlama-1.1B runs end-to-end through the integer CPU inference path.
- The integer path matches the float reference token-for-token on the included greedy verification gate: 80/80 tokens across four benchmark prompts.
- The fixed-point math library passes 335 unit/invariant tests.
- The determinism gate hashes raw integer outputs and matches the committed golden hash on the tested x86_64/gcc and arm64/clang environments.
- The tiny GPT demo trains and samples with integer arithmetic after downloading the small public names dataset.
The main non-claims:
- This is not fast.
- This is not a production inference engine.
- This does not prove chat-quality parity beyond the included verification prompts.
- This does not prove that integer arithmetic is always better than floating point.
That last point matters.
The GPU branch: useful negative evidence
The repo also contains a gpu/ directory. It is not part of the main build. It is included as a lab notebook because the negative results are useful.
The GPU experiments asked a separate question:
If the CPU integer path works, should the GPU path also be integer?
The answer, for this hardware and this model, was mostly no.
The FP16 GPU path worked very well:
- full TinyLlama decode
- 80/80 token match against the CPU integer oracle
- hundreds of tokens/sec on RTX-class hardware
But FP16 is floating point, so it does not support the integer-only thesis. It is a useful correctness sanity check, not part of the fixed-point integer path.
The integer or lower-precision GPU paths were less convincing:
- INT8 tensor-core path with simple per-tensor quantization: 29/80 token match and slower than FP16 in this decode setup.
- FP8 E4M3: faster than FP16 overall, but only 36/80 token match.
- INT64/Q16.48-style GPU kernels: exactness is possible in principle, but no tensor-core support means the path is not competitive with FP16 tensor cores.
- A fair CUDA-core-only comparison of FP32 vs INT32 fixed-point showed parity, not an integer speed advantage. At M=1 decode, both were bandwidth/latency-bound.
That result changed how I think about the experiment. Integer math is not automatically faster. Hardware support matters. On NVIDIA GPUs, FP16/FP8/INT8 tensor cores are the fast path; exact wider integer math is not.
So the GPU folder stays parked: useful evidence, but not part of the integer core.
Why publish it anyway?
Because the repo keeps the experiment small enough to inspect:
- a small integer training demo
- a larger integer inference demo
- a fixed-point math library
- a reproducibility gate
- a documented GPU branch that did not support the original hope
For me, the useful part is not a claim like "integer LLMs beat floating point." The useful part is narrower:
A Llama-family checkpoint can run through a fixed-point integer CPU inference core and match a float reference on a small token-level gate.
That seemed worth writing down.
What I would improve next
The next clean experiment would be a separate lower-precision SIMD branch: reduce the fixed-point format from Q16.48 to something like Q16.16 or Q8.24, then compare integer SIMD against float SIMD on the same CPU. That would test whether fixed-point arithmetic can be competitive on ordinary CPU hardware while keeping deterministic behavior. I would treat that as follow-up work, not as part of this first release.
Closing
This project is a fun experiment, not a product. It started from a simple question: can I rebuild enough of a tiny LLM stack so the core arithmetic is integer-only?
Within the included checks, it can.
The more honest answer is:
- integer-only CPU inference works
- reproducibility is the strongest property
- GPU acceleration prefers the formats the hardware was built for
- some negative results are as useful as the positive one
If you want to inspect or reproduce it, the repo is here:
https://github.com/nmicic/int-llm
Start with:
make input
make all
make regression
./gpt_int
Then read: