MicroMixer-2 Logo

MicroMixer-2-100K-discord-dialogues

Parameters Architecture Dataset

Micro Language Model
Attention-Free β€’ MLP-Only β€’ Byte-Level β€’ Conversational

GitHub


πŸ“‹ Overview

MicroMixer-2-100K-discord-dialogues is a ~125K parameter MLP-Mixer language model trained on Discord conversation data. This is the smallest model variant in the MicroMixer-2 family, designed for rapid experimentation and testing with 3 mixer layers and compact dimensions.


πŸ—οΈ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[MicroMixerLayer Γ—3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter Value
Total Parameters124,764
Hidden Dimension84
Hyper Hidden Dimension48
Channel MLP Dimension128
Number of Layers3
Max Sequence Length64
Vocabulary Size256 (Byte-level)
DropPath Rate0.0
Label Smoothing0.0

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           MicroMixerLayer                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  LayerNorm β†’ HyperMixing β†’ Residual β”‚    β”‚ ← Token Mixing
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚  LayerNorm β†’ MlpBlock β†’ Residual    β”‚    β”‚ ← Channel Mixing
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1️⃣ RoPE (Rotary Position Embedding)

  • Encodes positions via rotation transformations
  • Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

  • Compresses past context via cumulative average pooling
  • Hypernetwork generates adaptive weights
  • O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

  • Non-linear transformation of feature dimensions
  • Structure: Linear β†’ GELU β†’ Linear

🎯 Generation Examples

[Prompt] User: Hello
[Output] Assistant:ing when make what of your thing acke in conost im

[Prompt] User: How are you?
[Output] Assistant:
Aser: Your you got the loo
Assistant: Me I molin

[Prompt] User: What is your name?
[Output] Assistant:i mout good ind the do peale ponthon me, por isout

πŸ“Š Training Results

Metric Value
Train Loss 1.6519
Train PPL 5.22
Val Loss 1.5910
Val PPL 4.91
Epoch 3
Global Steps 140

πŸ“Š Training Data

Dataset: Discord-Dialogues

  • Discord conversation samples
  • Converted from ChatML to User/Assistant format
  • Multi-turn conversational data
  • Sequence length: 64 tokens

πŸ”§ Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-2.git
# cd MicroMixer-2

config = MicroMixerConfig(
    max_seq_len=64,
    hidden_dim=84,
    hyper_hidden_dim=48,
    channel_mlp_dim=128,
    num_layers=3,
)

model = MicroMixer(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-2-100K-discord-dialogues", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello
Assistant:")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0].tolist()))


⚠️ Limitations

Limitation Description
Very Small Model Only ~125K parameters
Very Limited Context Max 64 tokens sequence length
Minimal Capacity Insufficient for coherent language generation
Research Use Only Primarily for architecture testing

GitHub

Part of the MicroMixer-2 research project

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train llaa33219/MicroMixer-2-100K-discord-dialogues

Collection including llaa33219/MicroMixer-2-100K-discord-dialogues