MicroMixer-1-1M-TinyStories

Micro Language Model
Attention-Free • MLP-Only • Byte-Level

📋 Overview

MicroMixer-1-1M is the largest model in the series with ~1M parameters. It can generate sentences like "Once upon a time there was a little girl named Lily" with reasonable fluency. Supports the longest sequence length of 256 tokens.

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer ×3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter	Value
Total Parameters	`967,584`
Hidden Dimension	`224`
Channel MLP Dimension	`576`
Number of Layers	`3`
Max Sequence Length	`256`
Vocabulary Size	`256` (Byte-level)

Core Components

┌─────────────────────────────────────────────┐
│           ImprovedMixerLayer                 │
│  ┌─────────────────────────────────────┐    │
│  │  LayerNorm → HyperMixing → Residual │    │ ← Token Mixing
│  ├─────────────────────────────────────┤    │
│  │  LayerNorm → MlpBlock → Residual    │    │ ← Channel Mixing
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

1️⃣ RoPE (Rotary Position Embedding)

Encodes positions via rotation transformations
Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

Compresses past context via cumulative average pooling
Hypernetwork generates adaptive weights
O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

Non-linear transformation of feature dimensions
Structure: Linear → GELU → Linear

📈 Key Differences from 500K

Metric	500K	1M	Change
Parameters	557,328	967,584	1.7x
Hidden Dim	176	224	1.3x
Channel MLP	384	576	1.5x
Sequence Length	128	256	2x

🎯 Generation Examples

Prompt: "Once upon a time"
Output: "Once upon a time there was a little girl named Lily. She loved t..."

Prompt: "Hello"
Output: "Hellog ann he grit litle girls. She love loved in the"

📊 Model Comparison

Model	Parameters	Hidden	Seq Len	Quality
100K	136,908	84	64	⭐
300K	331,680	128	128	⭐⭐
500K	557,328	176	128	⭐⭐⭐
1M	967,584	224	256	⭐⭐⭐⭐

⚠️ Limitations

Limitation	Description
Grammatical Errors	Fully grammatical sentences still difficult
Name Instability	Same name varies: "Lily", "Limmy", "Amby"
Short Prompt Issues	"Hello", "The weather is" produce near-random output
Overfitting	Overfits to specific TinyStories phrases

📊 Training Data

Dataset: TinyStories

Simple children's stories dataset
Learns basic grammar and vocabulary
Contains many patterns like "Once upon a time", "little girl/boy"

🔧 Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=256,
    hidden_dim=224,
    channel_mlp_dim=576,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-1M-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))

_{Part of the MicroMixer-1 research project}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

llaa33219
/

MicroMixer-1-1M-TinyStories