TinyModel Mixtral 2M Top-3 MoE (tinyllama4gpt2m) HF Validation Suite

This repository provides an ultra-lightweight Llama 4 format model variant scaled down to a 2M class total parameter footprint, trained from scratch on the TinyStories dataset and explicitly utilizing a GPT-style tokenizer.

🎯 Primary Validation Objective: GPT-Style Tokenizer Verification (Core Purpose)

The foundational purpose of this entire suite is to isolate and verify the exact mathematical and structural behavior of a GPT-style Byte-level BPE Tokenizer (configured with add_prefix_space=True) on text from the TinyStories dataset. Testing tokenizer compliance on large models introduces unnecessary complexity. This 2M parameter configuration allows developers to ensure their text-to-token transformation, token-to-text decoding, and prefix-space fusion rules perfectly match the reference implementation, with immediate visibility into alignment results.

If your custom inference backend handles prefix spaces, boundary word fusions, or byte-level fallbacks incorrectly, the token IDs emitted here will immediately drift from the PyTorch reference baseline, isolating tokenization anomalies before tensor computations even begin.

In addition to tokenizer verification, this asset is calibrated to a 1,024 token context window utilizing Llama 3 RoPE Scaling (4.0x factor over a 256 base window), providing a comprehensive test bed for both text processing and advanced position embedding calculations.

📂 Repository Structure & File Descriptions

Hugging Face Native Format (`./hf/`)

Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem or compatible proprietary model parsers:

hf/model.safetensors: Raw unquantized matrix parameters containing all 5 expert sub-networks alongside the master router tensor (Gate) and GQA projection layers.
hf/config.json: Architectural specifications built around MixtralConfig criteria, explicitly enforcing num_attention_heads: 4, num_key_value_heads: 2, max_position_embeddings: 1024, and the llama3 type rope_scaling parameters.
hf/generation_config.json: Standard generation defaults for greedy search boundaries.
hf/tokenizer.json: The core Byte-level BPE tokenizer layout (configured with GPT-style add_prefix_space=True) containing vocabulary indices, pre-tokenization rules, and the merges map.
hf/tokenizer.model: A structural dummy file provided exclusively to maintain complete Llama/Mixtral asset footprint compatibility with legacy reference loaders.
hf/tokenizer_config.json: Metadata managing tokenization classes to guarantee correct handling of prefix spacing and automatic <s> (BOS) injection properly on the execution backend.

📂 Purpose & Design Philosophy (Verification Targets)

This checkpoint is engineered strictly as a deterministic validation test asset for computing platforms and custom inference environments.

Due to the compact vocabulary layout (4,000 tokens) and highly localized layer structure, it provides an ideal environment to isolate and profile specific compute structures:

GPT-Style Tokenization Mechanics: Validates that the word-boundary space management (add_prefix_space=True) and byte-level fallback merging match the GPT-2/Llama ecosystem exactly on TinyStories text. This isolates subtokens layout anomalies before text data interacts with embedding layers.
Llama 3 RoPE Scaling Verification: Validating multi-band frequency adjustments (factor=4.0, low_freq_factor=1.0, high_freq_factor=4.0, original_max_position_embeddings=256). This verifies whether the custom inference engine correctly bifurcates dimensions into high, medium, and low frequency bands and scales them accurately across an expanded 1,024-token sequence.
GQA Routing & Index Mapping: Verifying the group indexing logic where 4 query heads resolve to 2 distinct key/value head pairs, exposing stride offsets and boundary errors in attention loops.
Non-Standard Expert Routing (Anomalous Expert Counts): Explicitly tests the engine's capability to handle an unconventional and asymmetric Mixture-of-Experts (MoE) configuration: exactly 5 total local experts with 3 active experts selected per token (num_local_experts=5, num_experts_per_tok=3). This "strange" parameter ratio forces the runtime router to distribute weights and rank probabilities across an odd, non-power-of-two matrix layout, immediately exposing alignment or allocation bugs in top-k routing logic.
Dynamic Routing Isolation: Validating Top-3 gating allocation vectors and tracking row-index distribution matrices inside custom execution topologies.
Scatter/Gather Verification: Profiling the memory dispatch loops that split token matrices into independent expert segments and synthesize them back into the main residual stream.
Bit-Exact Logit Verification: Confirming that independent execution backends match the exact mathematical outputs, causal attention masks, and logits produced by the PyTorch reference runtime.

📂 Usage Examples

Loading Hugging Face Formats via Python

Because the configuration parameters are seamlessly matched with the standard Transformers schema, you can invoke the classes using automated loaders by pointing directly to the Hugging Face repository and subfolder.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Target repository and subfolder configuration
repo_id = "shibatch/tinyllama4gpt2m"
subfolder = "hf"

print("Loading MoE GQA configuration and GPT-style tokenizer layers from Hugging Face...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder=subfolder)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Once upon"
# Tokenize using the loaded GPT-style configuration
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Running inference loop (Validating GPT-style Tokenizer, Top-3 routing, GQA, and Llama3 RoPE Scaling)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

📂 Model Specifications

Architecture: Mixtral (MixtralForCausalLM)
Dataset: TinyStories
Total Parameters (num_local_experts = 5): 2M class footprint
Active Parameters (num_experts_per_tok = 3): 1.18M active during dispatch
Vocabulary Size (vocab_size): 4,000 (Byte-level BPE with strict GPT-style add_prefix_space=True configuration)
Hidden Size (hidden_size): 96
Number of Hidden Layers (num_hidden_layers): 2
Number of Attention Heads (num_heads / num_kv_heads): 4 / 2 (Grouped-Query Attention layout)
Individual Expert Internal Dimension (intermediate_size): 192 (SwiGLU structure)
Max Position Embeddings (max_position_embeddings): 1,024
RoPE Scaling (rope_scaling): {"type": "llama3", "factor": 4.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 256}
RMS Norm Epsilon (rms_norm_eps): 1e-5

📂 License

License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support