NOTE

This version is not fully trained. It was only trained on a fraction of wikipedia 103, meaning it will be highly unreliable overall, but the proof of concept is present and there will be a full version.

Based on that data; it reaches a ModernBert 92% and LongFormer 71% accuracy accordingly using a piece of the wikipedia validation set, roughly 3000 articles. This geometric memory when fully saturated with the necessary information, will create deeply complex encodings along many spectrum of encodings. For now it's only an approximation. An EFFECTIVE approximation, but still an approximation of what's to come.

GEOLIP-BERT-8192 β€” Teacher-Distilled Geometric Memory

Extends BERT-large (512 ctx) to 8192 effective context via geometric recurrent memory, trained by distillation from frozen long-context teachers.

Architecture

Document (up to 8192 tokens)
    β”‚
    β”œβ”€β”€ ModernBERT-large ─── 8192 ctx, full bidirectional ─── teacher
    β”‚   (frozen, 395M)
    β”‚
    β”œβ”€β”€ Longformer-large ─── 4096 ctx, sliding+global ──── teacher
    β”‚   (frozen, 435M)
    β”‚
    └── BERT-large ──────── 480 ctx Γ— 16 segments ──────── student
        (frozen, 334M)
        + Geometric Memory System (49345544 trainable)

Memory System Components

Component Params Function
Depth Compressor ~19M 8-layer CLS profile (8192-dim) β†’ 1024-dim anchor
Geometric Bank ~17M 128-anchor store + 2-layer cross-attention
GRU Gate ~6M Controls memory token updates between segments
Layer Fusion ~1M Learned weighted sum of 8 BERT layers
Teacher Projectors ~2M Procrustes-initialized Linear(1024β†’1024) Γ— 2

Multi-Layer Extraction

Extracts from BERT layers: (2, 5, 8, 11, 14, 17, 20, 23) spanning syntax β†’ mid-semantic β†’ deep-semantic β†’ task output.

Procrustes Pre-Alignment

Static orthogonal Procrustes computed once before training:

  • BERT β†’ ModernBERT: cos 0.003 β†’ 0.489
  • BERT β†’ Longformer: cos -0.001 β†’ 0.521

Used to initialize teacher projectors. Fine-tuned during training.

Training

  • Data: WikiText-103
  • Loss: InfoNCE distillation (student vs teacher CLS) + pentachoron CV
  • Teachers: ModernBERT-large (8192 ctx), Longformer-large (4096 ctx)
  • Student: BERT-large (frozen) + geometric memory (trainable)
  • Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB)

Geometric Regularization

Bank anchors regularized via Cayley-Menger pentachoron volumes (coefficient of variation β†’ target 0.20). This maintains uniform geometric structure in the anchor space, preventing collapse.

Connection to GEOLIP-Bertenstein

This model applies the Bertenstein pattern to context length:

  • Bertenstein: frozen modal experts teach a shared geometric space (cross-modal)
  • GEOLIP-BERT-8192: frozen long-context experts teach a memory system (cross-context)

Both exploit the same insight: a frozen expert provides a stable reference frame that prevents geometric collapse during self-supervised training.

Related Repos

Usage

from transformers import BertModel, BertTokenizer
from deep_bert_v3 import DeepBertV3, DeepBertV3Config
from safetensors.torch import load_file

config = DeepBertV3Config()
model = DeepBertV3(config)

# Load trained memory system weights
state = load_file("checkpoints/best/memory_system.safetensors")
model.load_state_dict(state, strict=False)

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased")
state = model.init_state(batch_size=1, device="cuda")

# Process document segment by segment
for segment_text in document_segments:
    tokens = tokenizer(segment_text, return_tensors="pt", max_length=480,
                       padding="max_length", truncation=True).to("cuda")
    outputs, state = model(tokens["input_ids"], tokens["attention_mask"], state)

# Final output encodes full document context
document_embedding = outputs["memory_output"]  # (1, 1024)

License

Apache 2.0

Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including AbstractPhil/geolip-bert-8192