Files changed (1) hide show
  1. README.md +129 -13
README.md CHANGED
@@ -1,14 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: mit
3
- datasets:
4
- - HuggingFaceM4/ChartQA
5
- language:
6
- - en
7
- metrics:
8
- - accuracy
9
- base_model:
10
- - Qwen/Qwen2-VL-2B-Instruct
11
- library_name: transformers
12
- tags:
13
- - code
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # chart-vision-qwen
3
+
4
+ **Qwen2-VL-2B-Instruct fine-tuned on ChartQA** using LoRA adapters for chart question answering.
5
+
6
+ *This is Submitted as a deliverable for Orange Problem, for NLP with DL course (UE23AM343BB1)
7
+
8
+ **Team:** Langrangers (PES University)
9
+ - Aaron Thomas Mathew — PES1UG23AM005
10
+ - Aman Kumar Mishra — PES1UG23AM040
11
+ - Preetham VJ — PES1UG23AM913
12
+
13
+ **GitHub Repository:** [Aman-K-Mishra/orange-chartqa-slm](https://github.com/Aman-K-Mishra/orange-chartqa-slm)
14
+
15
  ---
16
+
17
+ ## Model Description
18
+
19
+ Given a chart image (bar chart, line chart, pie chart, etc.) and a natural language question, this model predicts the answer. It was fine-tuned from [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) using LoRA (Low-Rank Adaptation) on the full ChartQA training split (28,299 samples).
20
+
21
+ | Property | Value |
22
+ |---|---|
23
+ | Base model | Qwen2-VL-2B-Instruct |
24
+ | Fine-tuning method | LoRA (PEFT) |
25
+ | Dataset | HuggingFaceM4/ChartQA |
26
+ | Training samples | 28,299 |
27
+ | Trainable parameters | 4.36M (0.20% of 2.21B) |
28
+ | Hardware | Tesla T4 (15.6 GB VRAM) |
29
+ | Epochs | 1 |
30
+
31
+ ---
32
+
33
+ ## Training Details
34
+
35
+ ### LoRA Configuration
36
+ | Parameter | Value | Reason |
37
+ |---|---|---|
38
+ | Rank (`r`) | 16 | rank=8 too small for chart reasoning; rank=32 risks OOM |
39
+ | Alpha | 32 | Standard `alpha = 2×rank` rule |
40
+ | Dropout | 0.05 | Light regularisation to prevent adapter overfitting |
41
+ | Target modules | q_proj, k_proj, v_proj, o_proj | Most impactful attention projections for VLMs |
42
+
43
+ ### Training Hyperparameters
44
+ | Parameter | Value | Reason |
45
+ |---|---|---|
46
+ | Batch size | 1 | OOM fix for T4 with max_length=768 |
47
+ | Gradient accumulation | 16 steps | Effective batch = 16 |
48
+ | Learning rate | 2e-4 | Standard for LoRA fine-tuning |
49
+ | Max sequence length | 768 | Compromise: 512 too short, 1024 causes OOM |
50
+ | Quantization | 8-bit (BitsAndBytes) | Full precision ≈ 16 GB; 8-bit ≈ 8 GB, safe for T4 |
51
+ | Image resolution | 256–512 patches (28×28) | Matches Qwen2-VL patch size; T4-safe |
52
+ | LR scheduler | Cosine annealing | Smooth decay over full epoch |
53
+
54
+ ---
55
+
56
+ ## How to Use
57
+
58
+ ### Installation
59
+
60
+ ```bash
61
+ pip install transformers peft bitsandbytes accelerate datasets pillow
62
+ ```
63
+
64
+ ### Load Adapters and Run Inference
65
+
66
+ ```python
67
+ from transformers import AutoProcessor, Qwen2VLForConditionalGeneration, BitsAndBytesConfig
68
+ from peft import PeftModel
69
+ from PIL import Image
70
+ import torch
71
+
72
+ BASE_MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"
73
+ ADAPTER_REPO = "preethamvj/chart-vision-qwen"
74
+
75
+ # Load base model in 8-bit
76
+ bnb_config = BitsAndBytesConfig(load_in_8bit=True)
77
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
78
+ BASE_MODEL_ID,
79
+ quantization_config=bnb_config,
80
+ device_map="auto",
81
+ torch_dtype=torch.float16
82
+ )
83
+
84
+ # Load LoRA adapters
85
+ model = PeftModel.from_pretrained(model, ADAPTER_REPO)
86
+
87
+ # Optional: merge adapters into base weights for faster inference
88
+ model = model.merge_and_unload()
89
+
90
+ processor = AutoProcessor.from_pretrained(
91
+ BASE_MODEL_ID,
92
+ min_pixels=256 * 28 * 28,
93
+ max_pixels=512 * 28 * 28
94
+ )
95
+
96
+ # Inference
97
+ image = Image.open("your_chart.png").convert("RGB")
98
+ question = "What is the highest value in the chart?"
99
+
100
+ messages = [{
101
+ "role": "user",
102
+ "content": [
103
+ {"type": "image", "image": image},
104
+ {"type": "text", "text": question}
105
+ ]
106
+ }]
107
+
108
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
109
+ inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
110
+
111
+ with torch.no_grad():
112
+ output = model.generate(**inputs, max_new_tokens=64)
113
+
114
+ answer = processor.decode(output[0], skip_special_tokens=True)
115
+ print("Answer:", answer.split("assistant")[-1].strip())
116
+ ```
117
+
118
+ ---
119
+
120
+ ## Intended Use
121
+
122
+ This model is intended for chart question answering tasks — reading values, trends, comparisons, and facts from chart images. It is not designed for general visual question answering outside the chart domain.
123
+
124
+ ---
125
+
126
+ ## Limitations
127
+
128
+ - Trained for only 1 epoch due to compute constraints (T4 GPU)
129
+ - Loss shows high variance across steps, suggesting the learning rate may benefit from tuning in future runs
130
+ - Performance may degrade on chart types not well-represented in ChartQA (e.g., highly complex infographics)