NOTE
This version is not fully trained. It was only trained on a fraction of wikipedia 103, meaning it will be highly unreliable overall, but the proof of concept is present and there will be a full version.
Based on that data; it reaches a ModernBert 92% and LongFormer 71% accuracy accordingly using a piece of the wikipedia validation set, roughly 3000 articles. This geometric memory when fully saturated with the necessary information, will create deeply complex encodings along many spectrum of encodings. For now it's only an approximation. An EFFECTIVE approximation, but still an approximation of what's to come.
GEOLIP-BERT-8192 β Teacher-Distilled Geometric Memory
Extends BERT-large (512 ctx) to 8192 effective context via geometric recurrent memory, trained by distillation from frozen long-context teachers.
Architecture
Document (up to 8192 tokens)
β
βββ ModernBERT-large βββ 8192 ctx, full bidirectional βββ teacher
β (frozen, 395M)
β
βββ Longformer-large βββ 4096 ctx, sliding+global ββββ teacher
β (frozen, 435M)
β
βββ BERT-large ββββββββ 480 ctx Γ 16 segments ββββββββ student
(frozen, 334M)
+ Geometric Memory System (49345544 trainable)
Memory System Components
| Component | Params | Function |
|---|---|---|
| Depth Compressor | ~19M | 8-layer CLS profile (8192-dim) β 1024-dim anchor |
| Geometric Bank | ~17M | 128-anchor store + 2-layer cross-attention |
| GRU Gate | ~6M | Controls memory token updates between segments |
| Layer Fusion | ~1M | Learned weighted sum of 8 BERT layers |
| Teacher Projectors | ~2M | Procrustes-initialized Linear(1024β1024) Γ 2 |
Multi-Layer Extraction
Extracts from BERT layers: (2, 5, 8, 11, 14, 17, 20, 23) spanning syntax β mid-semantic β deep-semantic β task output.
Procrustes Pre-Alignment
Static orthogonal Procrustes computed once before training:
- BERT β ModernBERT: cos 0.003 β 0.489
- BERT β Longformer: cos -0.001 β 0.521
Used to initialize teacher projectors. Fine-tuned during training.
Training
- Data: WikiText-103
- Loss: InfoNCE distillation (student vs teacher CLS) + pentachoron CV
- Teachers: ModernBERT-large (8192 ctx), Longformer-large (4096 ctx)
- Student: BERT-large (frozen) + geometric memory (trainable)
- Hardware: NVIDIA RTX PRO 6000 Blackwell (102 GB)
Geometric Regularization
Bank anchors regularized via Cayley-Menger pentachoron volumes (coefficient of variation β target 0.20). This maintains uniform geometric structure in the anchor space, preventing collapse.
Connection to GEOLIP-Bertenstein
This model applies the Bertenstein pattern to context length:
- Bertenstein: frozen modal experts teach a shared geometric space (cross-modal)
- GEOLIP-BERT-8192: frozen long-context experts teach a memory system (cross-context)
Both exploit the same insight: a frozen expert provides a stable reference frame that prevents geometric collapse during self-supervised training.
Related Repos
- AbstractPhil/geolip-bertenstein β Multi-modal geometric fusion
- AbstractPhil/procrustes-analysis β 17-model Procrustes profiling
- AbstractPhil/geolip-bertenstein-cache β Expert embedding caches
Usage
from transformers import BertModel, BertTokenizer
from deep_bert_v3 import DeepBertV3, DeepBertV3Config
from safetensors.torch import load_file
config = DeepBertV3Config()
model = DeepBertV3(config)
# Load trained memory system weights
state = load_file("checkpoints/best/memory_system.safetensors")
model.load_state_dict(state, strict=False)
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased")
state = model.init_state(batch_size=1, device="cuda")
# Process document segment by segment
for segment_text in document_segments:
tokens = tokenizer(segment_text, return_tensors="pt", max_length=480,
padding="max_length", truncation=True).to("cuda")
outputs, state = model(tokens["input_ids"], tokens["attention_mask"], state)
# Final output encodes full document context
document_embedding = outputs["memory_output"] # (1, 1024)
License
Apache 2.0
- Downloads last month
- 60