Eliovp commited on
Commit
7906db9
·
verified ·
1 Parent(s): 275c832

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen3-0.6B
4
+ pipeline_tag: text-generation
5
+ library_name: transformers
6
+ tags:
7
+ - FP8
8
+ - OCP
9
+ - AMD
10
+ - ROCM
11
+ - Quark
12
+ - vllm
13
+ ---
14
+
15
+ # Qwen3-0.6B-FP8-KV
16
+
17
+ > **Lightweight OCP FP8_e4m3 quant of Qwen3-0.6B** with end-to-end KV-cache FP8 support, built with AMD Quark for ROCm.
18
+
19
+
20
+ ## Introduction
21
+ Qwen3-0.6B-FP8-KV is an OCP-standard FP8_e4m3 quantization of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B),
22
+ produced with **AMD Quark**.
23
+
24
+
25
+ ## Quantization Strategy
26
+ - **Quantizer**: AMD Quark v0.9+
27
+ - **Numeric Format**: OCP FP8_e4m3, symmetric per-tensor
28
+ - **Scope**: All `Linear` layers (excl. `lm_head`), activations **and the KV cache**
29
+ - **Block Size**: 128 (OCP-aligned)
30
+ - **Calibration**: 128 Pile samples
31
+ - **Metadata**: scales & block info in JSON; weights in SafeTensors
32
+
33
+
34
+ ## Performance Snapshot
35
+
36
+ | Metric | FP16 Baseline | FP8_e4m3 Quantized |
37
+ |------------------------|--------------:|-------------------:|
38
+ | Wikitext2 Perplexity | ~22.1 | ~25.8 |
39
+ | Memory Footprint | 1.0× | 0.50× |
40
+ | Inference Throughput | 1.0× | 1.3× |
41
+
42
+ ## Evaluation
43
+ We measured perplexity on WikiText2:
44
+ - FP16 (Qwen3-0.6B) → 22.1 PPL
45
+ - FP8_e4m3 (this model) → 25.8 PPL
46
+
47
+ ## License
48
+ This model inherits the Qwen3-0.6B license.