chamber111 commited on
Commit
a249c70
·
verified ·
1 Parent(s): bdbb111

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -1
README.md CHANGED
@@ -31,4 +31,57 @@ As a result, VPPO-7B demonstrates significant performance improvements over stro
31
  ### Model Sources
32
 
33
  - **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL)
34
- - **Paper:** `[Please Fill In: Link to the arXiv paper]`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ### Model Sources
32
 
33
  - **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL)
34
+ - **Paper:** `[Please Fill In: Link to the arXiv paper]`
35
+
36
+ ## Training Details
37
+
38
+ ### Training Data
39
+
40
+ The model was fine-tuned on [**ViRL39K**](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K).
41
+
42
+ ### Training Procedure
43
+
44
+ The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)** algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step.
45
+
46
+ #### Training Hyperparameters
47
+
48
+ - **Base Model:** Qwen2.5-VL-7B-Instruct
49
+ - **Algorithm:** VPPO
50
+ - **Epochs:** 2
51
+ - **Learning Rate:** 1e-6
52
+ - **Rollout Batch Size:** 384
53
+ - **Max Response Length:** 2048
54
+ - **Entropy Penalty Coefficient:** 0.06
55
+ - **Gradient Filtering Ratio (k):** 0.4
56
+ - **Advantage Shaping Min (β_min):** 0.9
57
+ - **Training Regime:** bf16 mixed precision
58
+
59
+ ## Evaluation
60
+
61
+ ### Testing Data, Factors & Metrics
62
+
63
+ #### Testing Data
64
+
65
+ The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks:
66
+ - **Math & Geometry:** Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12
67
+ - **Logic:** LogicVista
68
+ - **Multi-discipline:** MMMU-Pro
69
+
70
+ #### Metrics
71
+
72
+ Performance is measured by **average accuracy@8**, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring.
73
+
74
+ ## Citation
75
+
76
+ If you use this model in your work, please cite our paper:
77
+
78
+ **BibTeX:**
79
+
80
+ <!-- ```bibtex
81
+ @article{yourname2025vppo,
82
+ title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
83
+ author={[Please Fill In: Authors of the paper]},
84
+ journal={arXiv preprint arXiv:2510.XXXXX},
85
+ year={2025}
86
+ }
87
+ ``` -->