Post
10
Curious reproducible fact: I trained a GPT-like decoder-only Transformer where the entire input embedding table is frozen and reduced to a 16‑D binary token-ID code (0/1) — this is NOT 16-bit quantization.
Key details:
- vocab_size = 65536, n_embed = 16 (2^16 = 65536 unique IDs)
- deterministic expansion 16 → d_model=1024 via repeat_interleave (scale=64)
- full embedding table is published (embeddings.txt) for auditability
Repro note + verification script:
https://huggingface.co/blog/Bochkov/emergent-semantics-beyond-token-embeddings
Model repo:
Bochkov/emergent-semantics-model-16-bit-269m
License: Apache-2.0
Key details:
- vocab_size = 65536, n_embed = 16 (2^16 = 65536 unique IDs)
- deterministic expansion 16 → d_model=1024 via repeat_interleave (scale=64)
- full embedding table is published (embeddings.txt) for auditability
Repro note + verification script:
https://huggingface.co/blog/Bochkov/emergent-semantics-beyond-token-embeddings
Model repo:
Bochkov/emergent-semantics-model-16-bit-269m
License: Apache-2.0