You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Multimodal Gemma-270M

A Multimodal Vision-Language Model combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.

🎯 Model Inference Examples

Here are real inference results from our trained model:

🐱 Animal Detection

Cats on Couch	White Cat Sleeping

🐕 Dog Recognition

Golden Retriever in Park

🏠 Room & Scene Understanding

Modern Kitchen	Clean Kitchen

🍕 Food & Objects

Food Scene	Apple on Table

🛹 Activity & People

Skate Park	Family Dining

📊 Training Details

Parameter	Value
Training Samples	157,712 (Full LLaVA dataset)
Epochs	3
Final Training Loss	1.333
Final Validation Loss	1.430
Total Parameters	539M
Trainable Parameters	18.6M (3.4%)
GPU	NVIDIA A100 40GB
Training Time	~9 hours
Batch Size	20 (effective: 40)
Precision	bf16-mixed

📈 Benchmark Results

Benchmark	Score
Basic VQA	53.8% (7/13 correct)
POPE Hallucination	20.0%

VQA Breakdown

✅ Animal identification (cats, dogs)
✅ Room identification (kitchen, living room)
✅ Object presence detection
⚠️ Color identification (moderate)
⚠️ Detailed attributes (needs improvement)

🏗️ Architecture

Component	Details
Language Model	Google Gemma-3-270M with LoRA adapters
Vision Encoder	OpenAI CLIP ViT-Large/14 (frozen, 428M params)
Vision Projector	MLP (3.4M params)
LoRA	r=16, alpha=32, dropout=0.1

🚀 Usage

from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image

# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()

# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)