Mythos-194M

A decoder-only language model built from scratch — LLaMA-compatible weights.

GitHub License PyTorch transformers


Production release. Full pre-training run.

Model Summary

Mythos is a LLaMA-style autoregressive transformer implemented from first principles in pure PyTorch — no transformers inheritance, no nn.TransformerBlock, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository.

This release packages the weights in the LlamaForCausalLM format so that the model is natively usable via the standard transformers, vLLM, TGI, and llama.cpp toolchains — no custom code or trust_remote_code required.

Developed by Boris Graudt
Model type Decoder-only causal transformer
Language English
License MIT
Compatible with 🤗 transformers, vLLM, TGI, llama.cpp, Ollama
Reference implementation github.com/borisgraudt/mythos

Architecture

Component Choice Value
Parameters 194 M
Hidden layers Pre-norm decoder blocks 24
Hidden size d_model 768
Intermediate size SwiGLU hidden 2048
Attention heads Multi-head 12
Key / value heads Grouped-Query Attention 4
Head dim d_model / n_heads 64
Positional encoding Rotary (RoPE) θ = 10,000
Normalization RMSNorm (pre-norm) ε = 1e-05
Activation SwiGLU
Tied embeddings Embedding ↔ LM head
Vocabulary ByteLevel BPE 31,021
Context length Max sequence 2,048

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Serving with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos

Serving with llama.cpp

# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"

Training

Data

  • Corpus: mixed web + code (details in the GitHub repo)
  • Tokenizer: ByteLevel BPE trained from scratch, vocab size 31,021
  • Training context: 512 tokens

Hyperparameters

Steps 16,000
Optimizer AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR schedule Cosine decay, 2 000-step warmup
Peak learning rate 3 × 10⁻⁴
Precision bfloat16 mixed
Hardware A100 40 GB

Limitations and Intended Use

  • Base model only — no instruction tuning, no RLHF, no safety alignment.

  • English-only; non-English performance is poor.

  • May reproduce biases and factual errors from the training distribution.

  • Not suitable for medical, legal, financial, or other high-stakes applications.

Citation

@software{graudt2026mythos,
  author  = {Graudt, Boris},
  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
  year    = {2026},
  url     = {https://github.com/borisgraudt/mythos},
  license = {MIT}
}

Acknowledgements

Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B (Jiang et al., 2023). Data pipeline follows the FineWeb methodology (Penedo et al., 2024).

Downloads last month
2,072
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support