arxiv:2604.04155

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Published on Apr 5

· Submitted by

Prashant Raju on Apr 7

Upvote

Authors:

Prashant C. Raju

Abstract

Foundation models in biology and physics suffer from geometric distortion due to discrete categorical bottlenecks, with continuous objectives showing significantly better preservation of system geometry compared to discrete tokenization approaches.

AI-generated summary

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

View arXiv page View PDF GitHub 0 Add to collection

Community

pcr2120

Paper author Paper submitter about 20 hours ago

TL;DR: Biological and physical foundation models systematically fail to preserve the continuous geometry of the natural world. The root cause? Forcing continuous physical manifolds through discrete categorical bottlenecks (tokenization + cross-entropy loss).
Key Highlights Across 14 Models:
• The Causal Proof: Swapping a discrete cross-entropy head for a continuous MSE head on identical encoders reduces geometric distortion by up to 8.5x.
• Scaling Makes It Worse: Evaluating the ESM-2 suite shows geometric stability declines from 8M to 3B parameters. The apparent "recovery" at 15B is actually coherent global manifold drift.
• The Evo 2 Illusion: A controlled ablation proves Evo 2's (7B) reverse-complement robustness on real DNA is an artifact of sequence-level k-mer histogram matching, not learned biophysics.
• The Information-Theoretic Trap: Using MINE and rate-distortion theory, we prove models inevitably fall into one of three failure regimes: Local-Global Decoupling (Evo 2), Representational Compression (OpenFold), or Geometric Vacuity (ProtMamba).
The Takeaway: Scale is not all you need for AI for Science. To build true Scientific AGI, we must move beyond the discrete tokenization playbook and design architectures that natively preserve continuous physical invariants.

librarian-bot

about 6 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.04155

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04155 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04155 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04155 in a Space README.md to link it from this page.