--- datasets: - Subh775/Traffic-Perception-VL language: - en base_model: - vikhyatk/moondream2 pipeline_tag: image-text-to-text library_name: transformers license: apache-2.0 tags: - moondream2 - VLM --- # Perception-moondream2 **Perception-moondream2** is a specialized Vision-Language Model (VLM) fine-tuned for dense urban traffic scene understanding. Built on top of the highly efficient `moondream2` architecture, this model is designed to analyze CCTV and traffic camera feeds to generate highly detailed, comprehensive textual descriptions of traffic conditions. ## Model Details - **Base Model:** [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) (Revision: 2024-08-26) - **Architecture:** Vision Encoder + Phi-1.5 Text Decoder - **Task:** Dense Image Captioning & Visual Question Answering (VQA) - **Language:** English ## Training Data The model was fine-tuned on the [Subh775/Traffic-Perception-VL](https://huggingface.co/datasets/Subh775/Traffic-Perception-VL) dataset. This dataset consists of complex, real-world urban traffic scenes (such as bustling streets in Bengaluru, India). The training focused on teaching the model to accurately perceive and describe: - **Vehicle Types & Colors:** Identifying auto-rickshaws, scooters, motorcycles, and cars. - **Traffic Density & Flow:** Estimating congestion levels and movement. - **Pedestrian Activity:** Tracking people walking on sidewalks or crossing streets. - **Infrastructure:** Recognizing road layouts, lanes, shops, signage, and greenery. ## Intended Use Cases - **Smart City Analytics:** Automated monitoring of CCTV feeds to detect congestion or accidents. - **Traffic Management:** Generating real-time text logs of intersection activity. - **Autonomous Driving Context:** Providing dense contextual descriptions for self-driving datasets. ## Usage Because this model relies on the custom Moondream2 architecture, you will need to use `trust_remote_code=True` when loading it via the `transformers` library. ### Prerequisites Make sure you have the required libraries installed: ```bash !pip install transformers==4.44.2 "huggingface_hub<1.0" accelerate pillow einops ``` ### Load Tokenizer & Model ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import requests model_id = "Subh775/Perception-moondream2" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.float16, # REMOVED device_map="auto" ) # move to the GPU model = model.to("cuda") model.eval() ``` # Inference ```python image_path = "path_to_image" image = Image.open(image_path).convert("RGB") enc_image = model.encode_image(image) # Give it explicit instructions & explicitly ban the geographic bias. prompt = ( "Describe this traffic scene in detail. Focus strictly on the vehicles, " "pedestrians, infrastructure, and traffic density. Do not mention Bengaluru, " "India, or any specific geographic locations." ) answer = model.answer_question(enc_image, prompt, tokenizer) banned_phrases = ["in Bengaluru, India", "in Bengaluru", "Bengaluru, India,", "Bengaluru,"] for banned in banned_phrases: answer = answer.replace(banned, "") print(answer.strip()) ```