How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "megistone/qwen2-5-3b-vl-websight-plus-lora"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "megistone/qwen2-5-3b-vl-websight-plus-lora",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/megistone/qwen2-5-3b-vl-websight-plus-lora
Quick Links

Model Card for qwen2-5-3b-vl-websight-plus-lora

Model Details

Model Description

This model is a Low-Rank Adaptation (LoRA) fine-tune of the Qwen/Qwen2.5-VL-3B-Instruct base model. It is specifically optimized for visual web development tasks, translating visual inputs such as hand-drawn wireframes, low-fidelity sketches, and digital web screenshots directly into structural HTML and Tailwind CSS code.

The "plus" in the model designation indicates a composite training methodology utilizing both the standard WebSight dataset and the custom Sketch2Code dataset, enhancing the model's spatial reasoning and abstract UI interpretation capabilities.

  • Developed by: megistone (Mykyta Kyselov, https://orcid.org/0009-0005-3686-3419)
  • Model type: Vision-Language Model (VLM) with PEFT/LoRA adapter
  • Language(s) (NLP): English (Focus on HTML, CSS, and structural web markup)
  • License: Apache 2.0 (Inherited from standard Qwen2.5 terms, subject to base model constraints)
  • Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct

Model Sources

Uses

Direct Use

The primary function of this model is image-to-code generation. Users can input images of UI designs (sketches, wireframes, or rendered screenshots) accompanied by a text prompt (e.g., "Convert this wireframe to clean HTML/CSS code"), and the model will output corresponding semantic HTML5 and Tailwind CSS markup.

Downstream Use

Integration into automated prototyping workflows, design-to-code pipelines, or as a backend engine for UI/UX drafting applications where rapid iteration from whiteboard to functional prototype is required.

Out-of-Scope Use

This model is not intended for general-purpose conversational tasks, complex logic implementation (JavaScript/backend programming), or generating non-UI-related code. It should not be deployed in automated production pipelines without human-in-the-loop review, as generative AI code may contain rendering hallucinations or accessibility oversights.

Bias, Risks, and Limitations

While fine-tuned on diverse UI layouts, the model may exhibit biases toward common web design patterns present in the WebSight and Sketch2Code datasets. It may struggle with highly unconventional, abstract, or text-heavy layouts where optical character recognition (OCR) ambiguity exists. Generated code may not consistently adhere to strict WCAG accessibility guidelines without manual intervention.

Recommendations

Users are advised to manually review, debug, and validate all generated HTML/CSS code before deploying it in production environments. Output should be treated as a high-fidelity prototype rather than a final production artifact.

How to Get Started with the Model

Use the following Python code to initialize the base model and load the LoRA weights using the transformers and peft libraries.

import torch
from transformers import AutoProcessor
from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration

# Define model paths
base_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_id = "megistone/qwen2-5-3b-vl-websight-plus-lora"

# Load processor and base model
processor = AutoProcessor.from_pretrained(base_model_id)
base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    base_model_id, 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)

# Example inference structure
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {"type": "image", "image": "path_to_your_sketch.png"},
#             {"type": "text", "text": "Convert this wireframe to clean HTML/CSS code"}
#         ]
#     }
# ]
# text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Training Details

Training Data

The model was fine-tuned on a composite dataset strategy denoted as "websight-plus":

  1. WebSight Dataset: Utilized for foundational understanding of modern web layouts, syntax formatting, and high-fidelity screenshot-to-code translation.
  2. Sketch2Code Dataset: A dataset comprising hand-drawn wireframes paired with HTML and Tailwind CSS code, authored by Mykyta Kyselov and Illia Denysenko (2026). This dataset introduces imperfect human geometries, forcing the model to map abstract sketches to rigid structural code. The dataset is accessible via Kaggle and formally archived on Zenodo (DOI: 10.5281/zenodo.19741715).

Training Procedure

The training utilized Low-Rank Adaptation (LoRA) to efficiently update attention mechanisms without modifying the full 3 billion parameter count of the base model. Safetensors format was used for weight serialization.

Training Hyperparameters

  • Training regime: Mixed precision (bfloat16)
  • Architecture Updates: PEFT (Parameter-Efficient Fine-Tuning) via LoRA adapter targeting specific linear layers of the Qwen2.5-VL architecture.
  • Adapter Size: Approximately 150 MB (adapter_model.safetensors).

Evaluation

Testing Data, Factors & Metrics

Evaluation was conducted on a held-out validation subset of the composite dataset to measure the model's ability to minimize language modeling loss given visual context.

Results

Based on the recorded TensorBoard telemetry:

  • Evaluation Loss: Demonstrated steady convergence, decreasing from an initial value of approximately 0.56 to a final value of 0.4434 at step 200.
  • Evaluation Runtime: Averaged approximately 284 seconds per evaluation phase.
  • Throughput: Maintained a constant rate of 0.176 samples per second and 0.176 steps per second during evaluation metrics logging.

The learning curve indicates successful adaptation to the target domain without immediate signs of severe overfitting within the observed 200-step window.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: GPU acceleration
  • Hours used: Estimated minor impact due to parameter-efficient fine-tuning (LoRA).
  • Carbon Emitted: Minimal comparative footprint due to the 161 MB adapter size and short convergence window.

Technical Specifications

Model Architecture and Objective

The model relies on the Qwen2.5 Vision-Language architecture, which integrates a visual encoder with a large language model. The objective function during fine-tuning was standard causal language modeling (next-token prediction) conditioned on the combined text prompt and processed visual tokens.

Compute Infrastructure

Software

  • Hugging Face Transformers
  • PEFT 0.18.1
  • TensorBoard
  • Safetensors

Acknowledgments

Contributors

This model is the result of a joint effort between the authors. Special thanks to Illia Denysenko for the development of the functional implementation framework and his significant role in the construction of the Sketch2Code dataset.

  1. Dataset Curation: Significant contribution to the collection, cleaning, and labeling of the Sketch2Code dataset (DOI: 10.5281/zenodo.19741715).
  2. Implementation: Development of the primary application and integration logic that utilizes this LoRA adapter for real-world design-to-code conversion.

Citation

Model Citation:

@misc{qwen2.5-vl-websight-plus-lora,
  author = {megistone},
  title = {qwen2-5-3b-vl-websight-plus-lora},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
  howpublished = {\url{https://huggingface.co/megistone/qwen2-5-3b-vl-websight-plus-lora}}
}

Training Data Citation:

@dataset{kyselov_denysenko_2026_19741715,
  author       = {Mykyta Kyselov and Illia Denysenko},
  title        = {Sketch2Code: Hand-Drawn Wireframes to HTML/Tailwind Dataset},
  month        = jan,
  year         = 2026,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.19741715},
  url          = {https://doi.org/10.5281/zenodo.19741715}
}

APA for Training Data: Mykyta Kyselov, & Illia Denysenko. (2026). Sketch2Code: Hand-Drawn Wireframes to HTML/Tailwind Dataset (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.19741715

Downloads last month
116
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for megistone/qwen2-5-3b-vl-websight-plus-lora

Adapter
(157)
this model

Dataset used to train megistone/qwen2-5-3b-vl-websight-plus-lora