MicroMixer-1 Logo

MicroMixer-1-1M-TinyStories

Parameters Architecture Dataset

Micro Language Model
Attention-Free β€’ MLP-Only β€’ Byte-Level

GitHub


πŸ“‹ Overview

MicroMixer-1-1M is the largest model in the series with ~1M parameters. It can generate sentences like "Once upon a time there was a little girl named Lily" with reasonable fluency. Supports the longest sequence length of 256 tokens.


πŸ—οΈ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer Γ—3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter Value
Total Parameters967,584
Hidden Dimension224
Channel MLP Dimension576
Number of Layers3
Max Sequence Length256
Vocabulary Size256 (Byte-level)

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           ImprovedMixerLayer                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  LayerNorm β†’ HyperMixing β†’ Residual β”‚    β”‚ ← Token Mixing
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚  LayerNorm β†’ MlpBlock β†’ Residual    β”‚    β”‚ ← Channel Mixing
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1️⃣ RoPE (Rotary Position Embedding)

  • Encodes positions via rotation transformations
  • Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

  • Compresses past context via cumulative average pooling
  • Hypernetwork generates adaptive weights
  • O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

  • Non-linear transformation of feature dimensions
  • Structure: Linear β†’ GELU β†’ Linear

πŸ“ˆ Key Differences from 500K

Metric 500K 1M Change
Parameters 557,328 967,584 1.7x
Hidden Dim 176 224 1.3x
Channel MLP 384 576 1.5x
Sequence Length 128 256 2x

🎯 Generation Examples

Prompt: "Once upon a time"
Output: "Once upon a time there was a little girl named Lily. She loved t..."

Prompt: "Hello"
Output: "Hellog ann he grit litle girls. She love loved in the"

πŸ“Š Model Comparison

Model Parameters Hidden Seq Len Quality
100K 136,908 84 64 ⭐
300K 331,680 128 128 ⭐⭐
500K 557,328 176 128 ⭐⭐⭐
1M 967,584 224 256 ⭐⭐⭐⭐

⚠️ Limitations

Limitation Description
Grammatical Errors Fully grammatical sentences still difficult
Name Instability Same name varies: "Lily", "Limmy", "Amby"
Short Prompt Issues "Hello", "The weather is" produce near-random output
Overfitting Overfits to specific TinyStories phrases

πŸ“Š Training Data

Dataset: TinyStories

  • Simple children's stories dataset
  • Learns basic grammar and vocabulary
  • Contains many patterns like "Once upon a time", "little girl/boy"

πŸ”§ Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=256,
    hidden_dim=224,
    channel_mlp_dim=576,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-1M-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))


GitHub

Part of the MicroMixer-1 research project

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train llaa33219/MicroMixer-1-1M-TinyStories

Collection including llaa33219/MicroMixer-1-1M-TinyStories