Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
adarshzolekar
's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)
Multimodal AI Models
updated
Jan 23
Purpose: Models that understand text + image + audio together.
Upvote
1
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text
•
7B
•
Updated
Jun 6, 2025
•
3.53M
•
361
Salesforce/blip-image-captioning-base
Image-to-Text
•
Updated
Feb 3, 2025
•
2.55M
•
856
google/pix2struct-base
Image-to-Text
•
0.3B
•
Updated
Dec 24, 2023
•
3.73k
•
79
microsoft/kosmos-2-patch14-224
Image-to-Text
•
2B
•
Updated
Nov 28, 2023
•
169k
•
184
openbmb/MiniCPM-V-4_5
Image-Text-to-Text
•
9B
•
Updated
Mar 10
•
130k
•
1.09k
Upvote
1
Share collection
View history
Collection guide
Browse collections