Buckets:

meka1018
/

2

29.6 GB

14 files

Updated 8 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
.gitattributes	1.52 kB xet	8 days ago	818ba6de
README.md	6.79 kB xet	8 days ago	045410e5
ae.safetensors	335 MB xet	8 days ago	f73eecf7
config.json	1.44 kB xet	8 days ago	03fe3025
ema.safetensors	29.2 GB xet	8 days ago	990ed24a
generation_config.json	243 Bytes xet	8 days ago	c6387bc6
llm_config.json	663 Bytes xet	8 days ago	7803f0cf
merges.txt	1.67 MB xet	8 days ago	c4f1ad7c
model.safetensors.index.json	123 kB xet	8 days ago	123aec89
preprocessor_config.json	392 Bytes xet	8 days ago	3c58fcab
tokenizer.json	7.03 MB xet	8 days ago	adad3893
tokenizer_config.json	7.31 kB xet	8 days ago	79edf5aa
vit_config.json	205 Bytes xet	8 days ago	ad11816e
vocab.json	2.78 MB xet	8 days ago	9208e1be

README.md

🥯 BAGEL • Unified Model for Multimodal Understanding and Generation

We present BAGEL, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3. Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.

This repository hosts the model weights for BAGEL. For installation, usage instructions, and further documentation, please visit our GitHub repository.

🧠 Method

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.

BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.

🌱 Emerging Properties

As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.

📊 Benchmarks

1. Visual Understanding

Model	MME ↑	MMBench ↑	MMMU ↑	MM-Vet ↑	MathVista ↑
Janus-Pro-7B	-	79.2	41.0	50.0	–
Qwen2.5-VL-7B	2347	83.5	58.6	67.1	68.2
BAGEL	2388	85.0	55.3	67.2	73.1

2. Text-to-Image Generation · GenEval

Model	Overall ↑
FLUX-1-dev	0.82
SD3-Medium	0.74
Janus-Pro-7B	0.80
BAGEL	0.88

3. Image Editing

Model	GEdit-Bench-EN (SC) ↑	GEdit-Bench-EN (PQ) ↑	GEdit-Bench-EN (O) ↑	IntelligentBench ↑
Step1X-Edit	7.09	6.76	6.70	14.9
Gemini-2-exp.	6.73	6.61	6.32	57.6
BAGEL	7.36	6.83	6.52	44.0
BAGEL+CoT	–	–	–	55.3

License

BAGEL is licensed under the Apache 2.0 license. It is finetuned from Qwen2.5-7B-Instruct and siglip-so400m-14-384-flash-attn2 model, and uses the FLUX.1-schnell VAE model, all under Apache 2.0.

✍️ Citation

@article{deng2025bagel,
  title   = {Emerging Properties in Unified Multimodal Pretraining},
  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
  journal = {arXiv preprint arXiv:2505.14683},
  year    = {2025}
}

Total size: 29.6 GB

Files: 14

Last updated: May 26

Pre-warmed CDN: US EU US EU