int-llm: a pure-integer LLM experiment in C

Community Article Published June 3, 2026

This is a short write-up for int-llm, a personal experiment in running small neural-network code with integer arithmetic instead of floating point.

The repository is here:

https://github.com/nmicic/int-llm

The project tries a simple constraint:

Keep the model compute path in fixed-point integer arithmetic, and move human-readable interpretation to the boundary.

The repo currently contains:

  • a tiny character-level GPT that trains and samples in Q16.48 fixed-point integer math
  • a CPU Llama-family inference path that can run TinyLlama-1.1B through a Q16.48 integer compute core
  • a native .mgw integer weight format for reloading converted Q16.48 weights
  • a standalone integer math library, fp_math.h
  • a determinism/regression gate
  • a parked CUDA/GPU branch with results that did not support the original hope

This is not a production runtime. It is slow, intentionally simple, and mostly useful as a reproducible experiment.

One boundary is worth stating clearly: the normal Hugging Face load path starts from safetensors weights and converts them into Q16.48 at load time. Floating point also exists in reference checks, display/profiling code, and other boundary paths. The narrower experiment is whether the model compute path can stay in fixed-point integer arithmetic, and whether the converted native path can reload the model as integer weights.

Why build this?

Most LLM work starts from floating point: FP32, BF16, FP16, FP8, or quantized formats around those. I wanted to test a different question:

How far can a small LLM stack go if the compute core is integer-only?

The design rule was:

Machine-native representation stays in the core; human-readable interpretation happens only at the boundary.

For the integer paths, that means:

  • no float or double in the compute core
  • no libm
  • Q16.48 fixed-point values stored in int64_t
  • __int128 intermediates for multiply/divide
  • explicit integer implementations for math functions used by the model

The project is less about being fast and more about making the behavior inspectable and reproducible.

The fixed-point format

The core format is Q16.48:

[63]    [62 .. 48]      [47 .. 0]
sign    integer part    fractional part

That gives:

  • range around ±32768
  • resolution of 2^-48, about 3.55e-15
  • enough fractional precision to preserve normal float16/bfloat16 model weights inside the Q16.48 range

The main math file is fp_math.h. It implements the pieces needed by the toy GPT and Llama path:

  • multiply/divide with 128-bit intermediates
  • square root and inverse square root
  • exp/log
  • sin/cos via CORDIC
  • sigmoid and SiLU
  • deterministic PRNG and sampling helpers

There is also a visual gallery showing the integer-math seed experiments behind the library: e, pi, sqrt(2), and Euler's identity evaluated as fixed-point approximations.

What runs?

The repo has two main runnable paths.

1. Tiny character GPT

This is inspired by Andrej Karpathy's MicroGPT/makemore-style examples. The repo does not ship a dataset; it downloads the public makemore names dataset into input.txt:

make input gpt_int
./gpt_int

That trains a very small character GPT and samples names. The point is not quality; the point is that the training and sampling path runs through integer fixed-point math.

There is also a float32 baseline:

make gpt_float
./gpt_float

2. Integer TinyLlama inference

The larger path is llama_int.c, a CPU-only Llama-family inference engine. With a Hugging Face TinyLlama directory:

huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir models/tinyllama
make llama_int
./llama_int models/tinyllama --generate --prompt "What is the capital of France?" --max-new-tokens 16

Weights are loaded from safetensors and converted into Q16.48 at load time. The engine supports streaming one layer at a time, or caching layers in memory.

There is also a native integer weight format:

./llama_int models/tinyllama --export-native models/tinyllama.mgw
./llama_int models/tinyllama.mgw --native --generate --prompt "What is the capital of France?" --max-new-tokens 16

The .mgw file stores the model as Q16.48 integer weights. In that mode, the original floating-point weights do not reappear at load time.

What is actually proven?

The repo tries to be explicit about the boundary between evidence and speculation.

The main positive results:

  • TinyLlama-1.1B runs end-to-end through the integer CPU inference path.
  • The integer path matches the float reference token-for-token on the included greedy verification gate: 80/80 tokens across four benchmark prompts.
  • The fixed-point math library passes 335 unit/invariant tests.
  • The determinism gate hashes raw integer outputs and matches the committed golden hash on the tested x86_64/gcc and arm64/clang environments.
  • The tiny GPT demo trains and samples with integer arithmetic after downloading the small public names dataset.

The main non-claims:

  • This is not fast.
  • This is not a production inference engine.
  • This does not prove chat-quality parity beyond the included verification prompts.
  • This does not prove that integer arithmetic is always better than floating point.

That last point matters.

The GPU branch: useful negative evidence

The repo also contains a gpu/ directory. It is not part of the main build. It is included as a lab notebook because the negative results are useful.

The GPU experiments asked a separate question:

If the CPU integer path works, should the GPU path also be integer?

The answer, for this hardware and this model, was mostly no.

The FP16 GPU path worked very well:

  • full TinyLlama decode
  • 80/80 token match against the CPU integer oracle
  • hundreds of tokens/sec on RTX-class hardware

But FP16 is floating point, so it does not support the integer-only thesis. It is a useful correctness sanity check, not part of the fixed-point integer path.

The integer or lower-precision GPU paths were less convincing:

  • INT8 tensor-core path with simple per-tensor quantization: 29/80 token match and slower than FP16 in this decode setup.
  • FP8 E4M3: faster than FP16 overall, but only 36/80 token match.
  • INT64/Q16.48-style GPU kernels: exactness is possible in principle, but no tensor-core support means the path is not competitive with FP16 tensor cores.
  • A fair CUDA-core-only comparison of FP32 vs INT32 fixed-point showed parity, not an integer speed advantage. At M=1 decode, both were bandwidth/latency-bound.

That result changed how I think about the experiment. Integer math is not automatically faster. Hardware support matters. On NVIDIA GPUs, FP16/FP8/INT8 tensor cores are the fast path; exact wider integer math is not.

So the GPU folder stays parked: useful evidence, but not part of the integer core.

Why publish it anyway?

Because the repo keeps the experiment small enough to inspect:

  • a small integer training demo
  • a larger integer inference demo
  • a fixed-point math library
  • a reproducibility gate
  • a documented GPU branch that did not support the original hope

For me, the useful part is not a claim like "integer LLMs beat floating point." The useful part is narrower:

A Llama-family checkpoint can run through a fixed-point integer CPU inference core and match a float reference on a small token-level gate.

That seemed worth writing down.

What I would improve next

The next clean experiment would be a separate lower-precision SIMD branch: reduce the fixed-point format from Q16.48 to something like Q16.16 or Q8.24, then compare integer SIMD against float SIMD on the same CPU. That would test whether fixed-point arithmetic can be competitive on ordinary CPU hardware while keeping deterministic behavior. I would treat that as follow-up work, not as part of this first release.

Closing

This project is a fun experiment, not a product. It started from a simple question: can I rebuild enough of a tiny LLM stack so the core arithmetic is integer-only?

Within the included checks, it can.

The more honest answer is:

  • integer-only CPU inference works
  • reproducibility is the strongest property
  • GPU acceleration prefers the formats the hardware was built for
  • some negative results are as useful as the positive one

If you want to inspect or reproduce it, the repo is here:

https://github.com/nmicic/int-llm

Start with:

make input
make all
make regression
./gpt_int

Then read:

Community

Sign up or log in to comment