GPT-X2-125M CIx Long-Context

Model Summary

This model is a custom-code derivative of AxiomicLabs/GPT-X2-125M, adapted for experimental long-context causal language modeling and architecture research.

The repository includes a Hugging Face Transformers-compatible GPT-X2 implementation with optional Symplectic Metric-RoPE Governor support and training utilities built around CIxOpt, a heterogeneous optimizer developed for efficient parameter routing across large projection matrices, sensitive normalization parameters, and optional governor modules.

The model is intended as a research checkpoint for compact long-context generation, positional encoding experiments, optimizer testing, and continued fine-tuning.

Base Model

  • Base model: AxiomicLabs/GPT-X2-125M
  • Model family: GPT-X2
  • Task: Causal language modeling / text generation
  • Language: English
  • Library: Hugging Face Transformers with custom code
  • License: Apache 2.0, unless otherwise restricted by upstream model or dataset terms

Architecture

This implementation uses a compact decoder-only GPT-X2-style architecture.

Default configuration:

text model_type: gptx2 vocab_size: 32768 hidden_size: 576 num_hidden_layers: 30 num_attention_heads: 9 num_key_value_heads: 3 head_dim: 64 intermediate_size: 1536 max_position_embeddings: 32768 rope_theta: 100000.0 rms_norm_eps: 1e-6 tie_word_embeddings: true

Core architecture features:

  • Decoder-only causal language model
  • 30 transformer blocks
  • 576 hidden size
  • 9 query attention heads
  • 3 key/value heads
  • Grouped-query attention
  • Rotary position embeddings
  • Optional YaRN-style RoPE scaling support
  • RMSNorm
  • SwiGLU MLP layers
  • Tied input and output embeddings
  • Dynamic cache support for generation
  • Left-padding-aware position ID handling
  • Safe causal language modeling loss behavior when labels are masked

Symplectic Metric-RoPE Governor

This checkpoint’s codebase includes an optional experimental mechanism called Symplectic Metric-RoPE Governor.

The governor is designed to test whether rotary position encoding can be modulated through a learned phase-space control layer while preserving identity behavior at initialization.

When enabled, the governor adds:

  • Global Hamiltonian-style clock state
  • Per-layer local symplectic clock state
  • Metric projection modules
  • Mass and spin deltas for rotary frequency modulation
  • Context-aware clock updates
  • Beam-search-safe clock-state reordering
  • Clock diagnostics
  • Optional clock regularization terms

The metric projection layers are designed to initialize safely so that zeroed governor projections begin from standard RoPE behavior before learning deviations.

Relevant configuration fields include:

text use_symplectic_rope symplectic_global_k_dim symplectic_local_k_dim symplectic_global_dt symplectic_local_dt symplectic_n_steps symplectic_context_scale symplectic_momentum_inject_scale metric_grad_scale metric_mass_global_scale metric_spin_global_scale metric_mass_local_scale metric_spin_local_scale metric_radial_base clock_reg_coeff clock_metric_reg_coeff clock_state_reg_coeff clock_smooth_reg_coeff return_clock_diagnostics

CIxOpt Optimizer

Training and experimentation were designed around CIxOpt, a custom heterogeneous optimizer.

CIxOpt supports:

  • AdamW-style adaptive updates
  • Lion-style sign momentum
  • AdaMax routing
  • ASGD-style averaging
  • Optional low-rank projected momentum
  • Native foreach vectorization where dimensions allow
  • Gradient centralization
  • Decoupled weight decay
  • Discrepancy-aware caution filtering for sign updates
  • Activation-aware decay hooks
  • fp32 optimizer state for fp16/bf16 parameter safety
  • Parameter-name registration for architecture-aware routing

The optimizer can route different parts of the model differently. For example:

text large projection matrices -> sign-momentum style updates normalization / sensitive params -> AdamW-style updates embedding / lm-head surfaces -> AdaMax-compatible routing governor / clock parameters -> precise adaptive update path

This makes the model useful for studying optimizer behavior in compact architectures where not every parameter type should be treated the same way.

Intended Use

This model is intended for:

  • Compact causal language modeling research
  • Long-context generation experiments
  • RoPE / YaRN / positional encoding studies
  • Symplectic Metric-RoPE experiments
  • CIxOpt optimizer testing
  • Continuation training experiments
  • Architecture ablations
  • Instruction-style fine-tuning experiments
  • Lightweight local text-generation prototypes

Out-of-Scope Use

This is an experimental research checkpoint. It should not be used as a sole authority or autonomous decision-maker in high-stakes settings.

Do not rely on this model alone for:

  • Medical advice
  • Legal conclusions
  • Financial decisions
  • Emergency response
  • Personnel screening
  • Critical infrastructure operations
  • Surveillance targeting
  • Autonomous cyber operations
  • Any setting requiring verified factual accuracy

Limitations

Known or expected limitations:

  • May hallucinate facts, dates, citations, or technical details
  • May inherit limitations from AxiomicLabs/GPT-X2-125M
  • May be sensitive to prompt formatting
  • Long-context support does not guarantee accurate long-range reasoning
  • Symplectic Metric-RoPE behavior is experimental
  • CIxOpt-based training may produce behavior different from AdamW-trained baselines
  • Safety behavior has not been fully evaluated
  • Benchmark results are not yet included
  • Generated outputs should be reviewed before use

Installation

Install the core dependencies:

bash pip install torch transformers safetensors

Because this model uses custom architecture code, load it with:

python trust_remote_code=True

Usage

python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token prompt = "Explain why stable positional encoding matters for long-context language models." inputs = tokenizer( prompt, return_tensors="pt", ).to(model.device) with torch.inference_mode(): output = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.95, repetition_penalty=1.05, pad_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(output[0], skip_special_tokens=True))

Chat-Style Usage

If the tokenizer includes a chat template:

python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) messages = [ { "role": "user", "content": "What is Symplectic Metric-RoPE and why might it help long-context modeling?" } ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) with torch.inference_mode(): output = model.generate( **inputs, max_new_tokens=384, do_sample=True, temperature=0.7, top_p=0.95, repetition_penalty=1.05, pad_token_id=tokenizer.eos_token_id, ) generated = output[0][inputs["input_ids"].shape[-1]:] print(tokenizer.decode(generated, skip_special_tokens=True))

Suggested Generation Settings

Balanced exploratory generation:

python generation_config = { "max_new_tokens": 384, "do_sample": True, "temperature": 0.7, "top_p": 0.95, "repetition_penalty": 1.05, }

More deterministic generation:

python generation_config = { "max_new_tokens": 384, "do_sample": False, }

Training Notes

This model is trained as a causal language model.

Recommended training setup:

text loss: causal language modeling loss padding labels: -100 optimizer: CIxOpt or AdamW-compatible optimizer gradient clipping: recommended use_cache during training: false mixed precision: bf16 preferred where supported

When training with padded batches, labels should mask padding tokens:

python labels = input_ids.clone() labels[attention_mask == 0] = -100

For chat-style supervised fine-tuning, assistant-only label masking is recommended when possible.

Evaluation

Formal benchmark results have not yet been added.

Recommended evaluations:

  • Held-out perplexity
  • Short-context and long-context generation checks
  • IFEval-style instruction following
  • Small reasoning suites
  • Repetition and degeneration testing
  • Side-by-side comparison against AxiomicLabs/GPT-X2-125M
  • Long-context retrieval and recall probes
  • Governor-on vs governor-off ablations, if applicable
  • CIxOpt vs AdamW optimizer comparisons

Safety and Responsible Use

This model may produce plausible but incorrect outputs. Users should independently verify important claims.

Before deployment, evaluate for:

  • Hallucination rate
  • Bias and toxicity
  • Prompt injection sensitivity
  • Refusal behavior
  • Domain-specific factuality
  • Robustness under long-context prompting
  • Failure modes introduced by custom positional encoding experiments

Citation

Base model:

bibtex @misc{axiomiclabs_gptx2_125m, title = {GPT-X2-125M}, author = {AxiomicLabs}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/AxiomicLabs/GPT-X2-125M}} }

If referencing this derivative checkpoint, cite the model repository and base model together.

Author / Maintainer

Fine-tuning, custom architecture work, and optimizer experimentation by: Convergent Intelligence LLC

Research areas include AI systems, mathematical frameworks, intelligence analysis, optimizer design, and efficient language-model adaptation.

Disclaimer

This checkpoint is provided for research and experimentation. It is not a verified expert system. Outputs require human review, especially in factual, technical, legal, medical, financial, operational, or safety-critical settings.

Downloads last month
381
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/GPT-X2-125M-CIx-Long-Context

Finetuned
(1)
this model

Collection including reaperdoesntknow/GPT-X2-125M-CIx-Long-Context