Atem Logo

Atem-3B

Ancient logic. Modern intelligence.

The 3B foundation model of the Atem series — direct reasoning at scale.

Base Model Stage Parameters License


Overview

Atem-3B is the first release in the 3B branch of the Atem model series — a Stage 1 supervised fine-tune on Qwen2.5-3B-Instruct across approximately 120,000 training examples spanning mathematics, code, reasoning, and general instruction following.

Where the 1.5B Atem line demonstrated that a small model could be meaningfully improved through careful data curation, Atem-3B applies the same methodology at twice the parameter count. The 3B base provides a stronger foundation — particularly for mathematical reasoning and structured generation — while the training corpus prioritises quality and diversity over volume.

Design philosophy: Think tags were stripped from all training data during preprocessing. Atem-3B is a direct-answer model — it does not produce <think> traces. The reasoning capacity of the 3B base is channelled into producing well-structured, considered responses rather than visible chain-of-thought. A CoT variant is planned for Stage 2.


The Atem Series

1.5B Series

Model Stage Capability
Atem v1 Stage 1 — SFT Fast, direct reasoning
Atem-Wisdom Stage 2 — CoT Explicit thinking traces
Atem-Pharaoh (planned) Stage 3 — DPO/IPO Preference-aligned reasoning

3B Series

Model Stage Capability
Atem-3B Stage 1 — SFT Direct reasoning at 3B scale
Atem-3B-Pharaoh Stage 2 — CoT Explicit thinking traces

Model Details

Property Value
Base model Qwen/Qwen2.5-3B-Instruct
Training method LoRA SFT — Stage 1 (think tags stripped)
LoRA config r=32, alpha=64, dropout=0.05
Parameters ~3.09B
Trainable parameters 59,867,136 (1.90%)
Training records 120,043 (after token length filtering)
Epochs 1
Final val loss 0.8384
Hardware NVIDIA A100-SXM4-80GB
Max sequence length 4,096 tokens
Precision bfloat16
License Apache 2.0

Output Format

Atem-3B produces direct, structured responses. Think tags were stripped from all training data during preprocessing — the model was trained exclusively on clean outputs with no chain-of-thought traces.

[Direct response — reasoned, structured, no <think> tags]

This is a deliberate Stage 1 design choice. A chain-of-thought variant exposing explicit reasoning traces is planned as Stage 2.


Training Data

Stage 1 training used approximately 120,000 examples drawn from eleven sources. All reasoning traces (<think>...</think> blocks) were stripped prior to training. Records shorter than 20 characters after stripping were excluded.

Dataset Count Focus
Modotte/CodeX-2M-Thinking 40,000 Code (think tags stripped)
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned 23,000 General reasoning (English filtered)
open-r1/OpenThoughts-114k-math 10,000 Mathematics (correct only)
flytech/python-codes-25k 10,000 Python code
FreedomIntelligence/medical-o1-reasoning-SFT 10,000 Medical reasoning
tuanha1305/DeepSeek-R1-Distill 9,000 Reasoning distillation
EphAsad/QWENMillenium-SF 5,000 General instruction
EphAsad/MistralMillenium-SF 5,000 General instruction
WithinUsAI/MiniMax_M2.7_Distilled_5k 5,000 Mixed reasoning
Jackrong/Claude-opus-4.7-TraceInversion-5000x 4,761 Inverted reasoning
EphAsad/Phi4Millennium-SF 2,932 General instruction

Chinese-language records from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion. OpenThoughts-114k-math was filtered to correct == True examples only.

Loss curve:

Step Train Loss Val Loss
200 0.9236 0.9011
400 0.9200 0.8796
600 0.8591 0.8685
800 0.8837 0.8585
1000 0.8455 0.8507
1200 0.8359 0.8453
1400 0.8240 0.8413
1600 0.8626 0.8391
1800 0.8940 0.8384
1876 (final) 0.8702 0.8384

Validation loss descends steadily throughout the full run with no overfitting signal.


Evaluation

Benchmark Results

Evaluated using lm-evaluation-harness via the Python API under identical conditions for both models. ARC-Challenge and HellaSwag use zero-shot normalised accuracy; GSM8K uses 5-shot. Both models evaluated at 4-bit quantisation on the same A100-SXM4-80GB in torch.float16.

Task Base (3B) Atem-3B Delta
ARC-Challenge 48.1% 48.0% -0.1% —
GSM8K (strict-match) 2.1% 37.1% +35.0%
GSM8K (flexible-extract) 62.4% 64.7% +2.3% ✓
HellaSwag 73.5% 70.4% -3.0% ⚠

Note on GSM8K: lm_eval's strict-match filter uses a #### number regex that only fires when the model produces that exact token sequence. The base Qwen2.5-3B-Instruct solves problems correctly but formats answers conversationally, yielding 2.1% strict-match against a 62.4% flexible-extract — the latter being the accurate measure of base model mathematical capability. Atem-3B's training on math distillation datasets reinforced structured answer termination, producing 37.1% strict-match. The meaningful comparison is flexible-extract: 62.4% → 64.7% (+2.3%) — a genuine but modest improvement. The strict-match delta is a formatting artefact, not a 35-point gain in mathematical reasoning ability.

Note on HellaSwag: The -3.0% regression is a common pattern when fine-tuning instruct models on structured reasoning and task-completion data. HellaSwag tests commonsense sentence completion in a multiple-choice format; training on problem-solving corpora shifts the model's distribution away from the casual, predictive register that HellaSwag measures. This is a known trade-off, not an indicator of general capability loss.


Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-3B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Explain the difference between a process and a thread."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1024,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        do_sample=True,
    )

response = tokenizer.decode(
    output[0][inputs.shape[1]:],
    skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="EphAsad/Atem-3B",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Write a Python function to find all prime numbers up to n."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = model.generate(
        input_ids=inputs,
        max_new_tokens=1024,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-3B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-3B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-3B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-3B:Q4_K_M

Available Files

File Size Description
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors ~6.2 GB Full bfloat16 weights
Atem-3b.Q4_K_M.gguf ~1.93 GB 4-bit — recommended
Atem-3b.Q5_K_M.gguf ~2.22 GB 5-bit
Atem-3b.Q8_0.gguf ~3.29 GB 8-bit — near-lossless

System Prompt

Atem-3B's identity is baked into the chat template and activates without an explicit system message. To override manually:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Roadmap

Stage Status Description
Stage 1 — SFT ✅ Complete Atem-3B — this model
Stage 2 — CoT SFT 🔄 Planned Atem-3B-Wisdom — chain-of-thought traces
Stage 3 — DPO/IPO 🔄 Planned Atem-3B-Pharaoh — preference-aligned reasoning

Citation

@misc{atem_3b_2026,
  author       = {Asad, Zain},
  title        = {Atem-3B: A 3B Direct-Reasoning Model via Stage 1 SFT},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EphAsad/Atem-3B}},
}

License

Released under the Apache 2.0 License, consistent with the base model (Qwen2.5-3B-Instruct).


Built independently by EphAsad

Downloads last month
502
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/Atem-3B

Base model

Qwen/Qwen2.5-3B
Adapter
(1287)
this model
Adapters
1 model

Datasets used to train EphAsad/Atem-3B

Evaluation results