Instructions to use litert-community/InternVL3-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/InternVL3-2B with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/InternVL3-2B \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use litert-community/InternVL3-2B with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
InternVL3-2B β LiteRT-LM (on-device Vision-Language Model)
OpenGVLab/InternVL3-2B converted to the
LiteRT-LM (.litertlm) format for on-device image+text inference with Google's
LiteRT-LM runtime (the engine behind the official
litert-community/* models, and the same runtime that runs litert-community/FastVLM-0.5B).
InternVL3-2B is a compact vision-language model: an InternViT vision encoder + pixel-shuffle +
MLP projector feeding a Qwen2.5-1.5B language decoder. This bundle runs it through LiteRT-LM's
fast_vlm multimodal path β give it an image and a question, get a grounded answer, fully on-device.
| File | InternVL3-2B.litertlm (~1.43 GB) |
| Vision | InternViT-300M encoder + pixel-shuffle + MLP projector, int8 weights β single 448Γ448 image β 256 image tokens |
| Decoder | Qwen2.5-1.5B, int4 weights (symmetric, blockwise-32 + OCTAV optimal-clipping); input embedding INT8 (externalized section) |
| Compute | integer |
| Context (KV cache) | 2048 |
| Image input | resized to 448Γ448 (ImageNet normalization is baked into the vision encoder) |
| Base model | OpenGVLab/InternVL3-2B |
Performance (iPhone 17 Pro, CPU)
| Load | ~3β4 s |
| Decode | ~20 tok/s |
| Multi-turn text | works (ask follow-up questions about the same image) |
The image is described accurately and in detail. The vision tower converts bit-faithfully to the reference (float CPU-parity corr β 1.0); int8 vision weights keep grounding quality.
β οΈ Known limitation β one image per conversation on the GPU backend
Single-image VQA β the primary use case β works great on GPU (~45 tok/s on iPhone 17 Pro). But on the GPU (Metal) backend, a second image in the same conversation truncates the answer β ask about one image per chat (start a new conversation for a different image).
This is GPU-delegate-specific, not a model/bundle issue: on the CPU backend, multi-image works
perfectly (verified). The same GPU truncation reproduces with Apple's litert-community/FastVLM-0.5B,
so it is general to the runtime's GPU fast_vlm path, not specific to this model. (Ruled out as causes:
max_num_images β CPU works with it set to 1; and the vision encoder's 5D reshape β a 4D-clean rebuild
still truncates on GPU.) For reliable multi-image, run on the CPU backend.
Run on iPhone / macOS
Use the LiteRT-LM Swift runtime (swift-litert-lm /
the LiteRTDemo sample). Load InternVL3-2B.litertlm with the image (vision) tower enabled
(modalities [.vision]), attach a photo, and ask a question.
Note for app integrators: this is a vision-only bundle (no audio tower). Bring up the engine with the vision modality only (
Modality.textImage/[.vision]) β requesting the audio tower (.all) on a bundle with no audio section fails at session creation.
Run on Android β Google AI Edge Gallery
Update (July 2026): Google AI Edge Gallery v1.0.16+ can import litert-lm models directly from Hugging Face inside the app (tap +) β no computer or
adbneeded. The manual steps below are only required on older builds or for sideloading a local file.
Run this model with image input in the official Google AI Edge Gallery app β no custom app needed (the bundle carries the tokenizer, chat template, and image preprocessing config):
- Push the bundle onto the phone (or download it there directly from this repo):
adb push InternVL3-2B.litertlm /sdcard/Download/ - Open the Gallery app, tap the + icon (bottom-right) and pick
InternVL3-2B.litertlmin the file picker. - In the Import Model dialog, check "Support image" (required for image input), pick GPU (fast) or CPU, then tap Import.
- Open the Ask Image task, choose the imported model, attach a photo, and ask.
Tip: on the GPU backend use one image per conversation (a known GPU-delegate trait of
fast_vlmmodels); pick CPU if you want multiple images in one chat.
Run on desktop (LiteRT-LM CLI)
The same .litertlm bundle runs on macOS / Linux / Windows with the official
LiteRT-LM CLI β including as a
local OpenAI-compatible API server:
pip install litert-lm
litert-lm import --from-huggingface-repo litert-community/InternVL3-2B InternVL3-2B.litertlm internvl3-2b
litert-lm run internvl3-2b # interactive chat in the terminal
litert-lm serve # local OpenAI-compatible API server
Conversion notes
- LiteRT-LM
fast_vlmbundle: VISION_ENCODER ([1,448,448,3]β[1,256,4096]) + VISION_ADAPTER ([1,256,4096]β[1,256,1536]) + single-token EMBEDDER + PREFILL_DECODE (embeddings-input). - The vision encoder bakes InternVL's ImageNet normalization and the NCHW transpose into the graph
(the runtime feeds a
[0,1]NHWC image). - The InternViT attention is rewritten to be 4D-clean (qkv split before the head reshape, avoiding
the 5D
reshape(B,N,3,H,d)that GPU delegates reject) β numerically identical (corr β 1.0), but it keeps the vision encoder almost entirely on the GPU delegate. - Decoder exported with externalized embedder; InternVL's dynamic-NTK
rope_scalingis stripped to base RoPE (valid since the export cache β€ the base context window).
License
MIT (the InternVL model) + Apache-2.0 (the Qwen2.5 language component). See the base model card. Converted artifacts are released under the same terms.
- Downloads last month
- 30
Model tree for litert-community/InternVL3-2B
Base model
OpenGVLab/InternVL3-2B-Pretrained