Gemma: Open Models Based on Gemini Research and Technology
Paper
β’
2403.08295
β’
Published
β’
50
A Multimodal Vision-Language Model combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.
Here are real inference results from our trained model:
| Parameter | Value |
|---|---|
| Training Samples | 157,712 (Full LLaVA dataset) |
| Epochs | 3 |
| Final Training Loss | 1.333 |
| Final Validation Loss | 1.430 |
| Total Parameters | 539M |
| Trainable Parameters | 18.6M (3.4%) |
| GPU | NVIDIA A100 40GB |
| Training Time | ~9 hours |
| Batch Size | 20 (effective: 40) |
| Precision | bf16-mixed |
| Benchmark | Score |
|---|---|
| Basic VQA | 53.8% (7/13 correct) |
| POPE Hallucination | 20.0% |
| Component | Details |
|---|---|
| Language Model | Google Gemma-3-270M with LoRA adapters |
| Vision Encoder | OpenAI CLIP ViT-Large/14 (frozen, 428M params) |
| Vision Projector | MLP (3.4M params) |
| LoRA | r=16, alpha=32, dropout=0.1 |
from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image
# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()
# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)
| File | Size | Description |
|---|---|---|
final_model.ckpt |
1.2GB | Full model checkpoint |
inference_results/ |
13.8MB | Example predictions with images |
Apache 2.0