Model Card for qwen2-5-3b-vl-websight-plus-lora
Model Details
Model Description
This model is a Low-Rank Adaptation (LoRA) fine-tune of the Qwen/Qwen2.5-VL-3B-Instruct base model. It is specifically optimized for visual web development tasks, translating visual inputs such as hand-drawn wireframes, low-fidelity sketches, and digital web screenshots directly into structural HTML and Tailwind CSS code.
The "plus" in the model designation indicates a composite training methodology utilizing both the standard WebSight dataset and the custom Sketch2Code dataset, enhancing the model's spatial reasoning and abstract UI interpretation capabilities.
- Developed by: megistone (Mykyta Kyselov, https://orcid.org/0009-0005-3686-3419)
- Model type: Vision-Language Model (VLM) with PEFT/LoRA adapter
- Language(s) (NLP): English (Focus on HTML, CSS, and structural web markup)
- License: Apache 2.0 (Inherited from standard Qwen2.5 terms, subject to base model constraints)
- Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct
Model Sources
- Repository: https://huggingface.co/megistone/qwen2-5-3b-vl-websight-plus-lora
- Base Model Repository: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Uses
Direct Use
The primary function of this model is image-to-code generation. Users can input images of UI designs (sketches, wireframes, or rendered screenshots) accompanied by a text prompt (e.g., "Convert this wireframe to clean HTML/CSS code"), and the model will output corresponding semantic HTML5 and Tailwind CSS markup.
Downstream Use
Integration into automated prototyping workflows, design-to-code pipelines, or as a backend engine for UI/UX drafting applications where rapid iteration from whiteboard to functional prototype is required.
Out-of-Scope Use
This model is not intended for general-purpose conversational tasks, complex logic implementation (JavaScript/backend programming), or generating non-UI-related code. It should not be deployed in automated production pipelines without human-in-the-loop review, as generative AI code may contain rendering hallucinations or accessibility oversights.
Bias, Risks, and Limitations
While fine-tuned on diverse UI layouts, the model may exhibit biases toward common web design patterns present in the WebSight and Sketch2Code datasets. It may struggle with highly unconventional, abstract, or text-heavy layouts where optical character recognition (OCR) ambiguity exists. Generated code may not consistently adhere to strict WCAG accessibility guidelines without manual intervention.
Recommendations
Users are advised to manually review, debug, and validate all generated HTML/CSS code before deploying it in production environments. Output should be treated as a high-fidelity prototype rather than a final production artifact.
How to Get Started with the Model
Use the following Python code to initialize the base model and load the LoRA weights using the transformers and peft libraries.
import torch
from transformers import AutoProcessor
from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration
# Define model paths
base_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_id = "megistone/qwen2-5-3b-vl-websight-plus-lora"
# Load processor and base model
processor = AutoProcessor.from_pretrained(base_model_id)
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
# Example inference structure
# messages = [
# {
# "role": "user",
# "content": [
# {"type": "image", "image": "path_to_your_sketch.png"},
# {"type": "text", "text": "Convert this wireframe to clean HTML/CSS code"}
# ]
# }
# ]
# text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
Training Details
Training Data
The model was fine-tuned on a composite dataset strategy denoted as "websight-plus":
- WebSight Dataset: Utilized for foundational understanding of modern web layouts, syntax formatting, and high-fidelity screenshot-to-code translation.
- Sketch2Code Dataset: A dataset comprising hand-drawn wireframes paired with HTML and Tailwind CSS code, authored by Mykyta Kyselov and Illia Denysenko (2026). This dataset introduces imperfect human geometries, forcing the model to map abstract sketches to rigid structural code. The dataset is accessible via Kaggle and formally archived on Zenodo (DOI: 10.5281/zenodo.19741715).
Training Procedure
The training utilized Low-Rank Adaptation (LoRA) to efficiently update attention mechanisms without modifying the full 3 billion parameter count of the base model. Safetensors format was used for weight serialization.
Training Hyperparameters
- Training regime: Mixed precision (bfloat16)
- Architecture Updates: PEFT (Parameter-Efficient Fine-Tuning) via LoRA adapter targeting specific linear layers of the Qwen2.5-VL architecture.
- Adapter Size: Approximately 150 MB (
adapter_model.safetensors).
Evaluation
Testing Data, Factors & Metrics
Evaluation was conducted on a held-out validation subset of the composite dataset to measure the model's ability to minimize language modeling loss given visual context.
Results
Based on the recorded TensorBoard telemetry:
- Evaluation Loss: Demonstrated steady convergence, decreasing from an initial value of approximately 0.56 to a final value of 0.4434 at step 200.
- Evaluation Runtime: Averaged approximately 284 seconds per evaluation phase.
- Throughput: Maintained a constant rate of 0.176 samples per second and 0.176 steps per second during evaluation metrics logging.
The learning curve indicates successful adaptation to the target domain without immediate signs of severe overfitting within the observed 200-step window.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: GPU acceleration
- Hours used: Estimated minor impact due to parameter-efficient fine-tuning (LoRA).
- Carbon Emitted: Minimal comparative footprint due to the 161 MB adapter size and short convergence window.
Technical Specifications
Model Architecture and Objective
The model relies on the Qwen2.5 Vision-Language architecture, which integrates a visual encoder with a large language model. The objective function during fine-tuning was standard causal language modeling (next-token prediction) conditioned on the combined text prompt and processed visual tokens.
Compute Infrastructure
Software
- Hugging Face Transformers
- PEFT 0.18.1
- TensorBoard
- Safetensors
Acknowledgments
Contributors
This model is the result of a joint effort between the authors. Special thanks to Illia Denysenko for the development of the functional implementation framework and his significant role in the construction of the Sketch2Code dataset.
- Dataset Curation: Significant contribution to the collection, cleaning, and labeling of the Sketch2Code dataset (DOI: 10.5281/zenodo.19741715).
- Implementation: Development of the primary application and integration logic that utilizes this LoRA adapter for real-world design-to-code conversion.
Citation
Model Citation:
@misc{qwen2.5-vl-websight-plus-lora,
author = {megistone},
title = {qwen2-5-3b-vl-websight-plus-lora},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
howpublished = {\url{https://huggingface.co/megistone/qwen2-5-3b-vl-websight-plus-lora}}
}
Training Data Citation:
@dataset{kyselov_denysenko_2026_19741715,
author = {Mykyta Kyselov and Illia Denysenko},
title = {Sketch2Code: Hand-Drawn Wireframes to HTML/Tailwind Dataset},
month = jan,
year = 2026,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.19741715},
url = {https://doi.org/10.5281/zenodo.19741715}
}
APA for Training Data: Mykyta Kyselov, & Illia Denysenko. (2026). Sketch2Code: Hand-Drawn Wireframes to HTML/Tailwind Dataset (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.19741715
- Downloads last month
- 116
Model tree for megistone/qwen2-5-3b-vl-websight-plus-lora
Base model
Qwen/Qwen2.5-VL-3B-Instruct