The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
Abstract
Foundation models in biology and physics suffer from geometric distortion due to discrete categorical bottlenecks, with continuous objectives showing significantly better preservation of system geometry compared to discrete tokenization approaches.
Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.
Community
TL;DR: Biological and physical foundation models systematically fail to preserve the continuous geometry of the natural world. The root cause? Forcing continuous physical manifolds through discrete categorical bottlenecks (tokenization + cross-entropy loss).
Key Highlights Across 14 Models:
• The Causal Proof: Swapping a discrete cross-entropy head for a continuous MSE head on identical encoders reduces geometric distortion by up to 8.5x.
• Scaling Makes It Worse: Evaluating the ESM-2 suite shows geometric stability declines from 8M to 3B parameters. The apparent "recovery" at 15B is actually coherent global manifold drift.
• The Evo 2 Illusion: A controlled ablation proves Evo 2's (7B) reverse-complement robustness on real DNA is an artifact of sequence-level k-mer histogram matching, not learned biophysics.
• The Information-Theoretic Trap: Using MINE and rate-distortion theory, we prove models inevitably fall into one of three failure regimes: Local-Global Decoupling (Evo 2), Representational Compression (OpenFold), or Geometric Vacuity (ProtMamba).
The Takeaway: Scale is not all you need for AI for Science. To build true Scientific AGI, we must move beyond the discrete tokenization playbook and design architectures that natively preserve continuous physical invariants.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization (2026)
- Symmetry in language statistics shapes the geometry of model representations (2026)
- Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement (2026)
- Social-JEPA: Emergent Geometric Isomorphism (2026)
- The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations (2026)
- Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization (2026)
- Relative Geometry of Neural Forecasters: Linking Accuracy and Alignment in Learned Latent Geometry (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.04155 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper