Mythos-194M

A decoder-only language model built from scratch — LLaMA-compatible weights.

Production release. Full pre-training run.

Model Summary

Mythos is a LLaMA-style autoregressive transformer implemented from first principles in pure PyTorch — no transformers inheritance, no nn.TransformerBlock, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository.

This release packages the weights in the LlamaForCausalLM format so that the model is natively usable via the standard transformers, vLLM, TGI, and llama.cpp toolchains — no custom code or trust_remote_code required.


Developed by	Boris Graudt
Model type	Decoder-only causal transformer
Language	English
License	MIT
Compatible with	🤗 `transformers`, vLLM, TGI, llama.cpp, Ollama
Reference implementation	github.com/borisgraudt/mythos

Architecture

Component	Choice	Value
Parameters	—	194 M
Hidden layers	Pre-norm decoder blocks	24
Hidden size	`d_model`	768
Intermediate size	SwiGLU hidden	2048
Attention heads	Multi-head	12
Key / value heads	Grouped-Query Attention	4
Head dim	`d_model / n_heads`	64
Positional encoding	Rotary (RoPE)	θ = 10,000
Normalization	RMSNorm (pre-norm)	ε = 1e-05
Activation	SwiGLU	—
Tied embeddings	Embedding ↔ LM head	✅
Vocabulary	ByteLevel BPE	31,021
Context length	Max sequence	2,048

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Serving with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos

Serving with llama.cpp

# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"

Training

Data

Corpus: mixed web + code (details in the GitHub repo)
Tokenizer: ByteLevel BPE trained from scratch, vocab size 31,021
Training context: 512 tokens

Hyperparameters


Steps	16,000
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR schedule	Cosine decay, 2 000-step warmup
Peak learning rate	3 × 10⁻⁴
Precision	bfloat16 mixed
Hardware	A100 40 GB

Limitations and Intended Use

Base model only — no instruction tuning, no RLHF, no safety alignment.
English-only; non-English performance is poor.
May reproduce biases and factual errors from the training distribution.
Not suitable for medical, legal, financial, or other high-stakes applications.

Citation

@software{graudt2026mythos,
  author  = {Graudt, Boris},
  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
  year    = {2026},
  url     = {https://github.com/borisgraudt/mythos},
  license = {MIT}
}

Acknowledgements

Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B (Jiang et al., 2023). Data pipeline follows the FineWeb methodology (Penedo et al., 2024).

Downloads last month: 2,072

Safetensors

Model size

0.2B params

Tensor type

F32