MagistrTheOne commited on
Commit
248e1b1
·
verified ·
1 Parent(s): cd4cebf

Update MagistrTheOne/RadonSAI-Small with safetensors weights

Browse files
README.md CHANGED
@@ -1,126 +1,44 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - ru
5
- - en
6
- tags:
7
- - mistral
8
- - russian
9
- - english
10
- - code
11
- - machine-learning
12
- - nlp
13
- - transformer
14
- - gpt2
15
- - small-model
16
- pipeline_tag: text-generation
17
- model-index:
18
- - name: RadonSAI-Small
19
- results:
20
- - task:
21
- type: text-generation
22
- name: Text Generation
23
- dataset:
24
- type: custom
25
- name: RADON Datasets
26
- metrics:
27
- - type: perplexity
28
- value: "TBD"
29
- name: Perplexity
30
- size_categories: 22M
31
- ---
32
-
33
- # RadonSAI-Small - 22M Parameter GPT2-based Russian-English Transformer
34
-
35
- ## Model Description
36
-
37
- RadonSAI-Small is a 22M parameter transformer model based on GPT2 architecture, optimized for Russian-English machine learning applications and development/testing purposes.
38
-
39
- ### Key Features
40
-
41
- - **Architecture**: GPT2-based with optimized parameters
42
- - **Parameters**: 21,764,608 parameters (22M)
43
- - **Context**: 512 tokens
44
- - **Tokenizer**: Optimized for Russian-English
45
- - **Status**: Ready for inference and fine-tuning
46
- - **Size**: Compact model for development and testing
47
-
48
- ### Model Weights
49
-
50
- This model contains properly initialized weights:
51
-
52
- - **Format**: Safetensors (.safetensors) + PyTorch (.bin)
53
- - **Dtype**: float32
54
- - **Initialization**: Random weights
55
- - **Size**: 86MB (22M parameters)
56
- - **Status**: Ready for inference and fine-tuning
57
-
58
- ### Usage
59
-
60
- ```python
61
- from transformers import AutoModelForCausalLM, AutoTokenizer
62
-
63
- # Load RadonSAI-Small
64
- model = AutoModelForCausalLM.from_pretrained("MagistrTheOne/RadonSAI-Small")
65
- tokenizer = AutoTokenizer.from_pretrained("MagistrTheOne/RadonSAI-Small")
66
-
67
- # Generate text
68
- prompt = "Машинное обучение - это"
69
- inputs = tokenizer(prompt, return_tensors="pt")
70
- outputs = model.generate(
71
- **inputs,
72
- max_length=100,
73
- temperature=0.7,
74
- do_sample=True,
75
- pad_token_id=tokenizer.eos_token_id
76
- )
77
- result = tokenizer.decode(outputs[0], skip_special_tokens=True)
78
- print(result)
79
- ```
80
-
81
- ### Model Architecture
82
-
83
  ```
84
- RadonSAI-Small:
85
- - Hidden size: 256
86
- - Layers: 6
87
- - Attention heads: 8
88
- - Intermediate size: 1,024
89
- - Vocabulary: 32,000
90
- - Context window: 512 tokens
91
- - Architecture: GPT2LMHeadModel
92
- ```
93
-
94
- ### Performance
95
-
96
- - **Speed**: Fast inference on CPU/GPU
97
- - **Memory**: 86MB memory usage
98
- - **Quality**: Development/testing model
99
- - **Languages**: English + Russian support
100
-
101
- ### Use Cases
102
-
103
- - **Development**: Quick prototyping and testing
104
- - **Learning**: Educational purposes
105
- - **Experimentation**: Model architecture research
106
- - **Resource-constrained**: Low-memory environments
107
-
108
- ### Citation
109
-
110
- ```bibtex
111
- @misc{radonsaismall2025,
112
- title={RadonSAI-Small: 22M Parameter GPT2-based Russian-English Transformer},
113
- author={MagistrTheOne},
114
- year={2025},
115
- url={https://huggingface.co/MagistrTheOne/RadonSAI-Small}
116
- }
117
- ```
118
-
119
- ### License
120
-
121
- Apache 2.0 License
122
-
123
- ### Contact
124
 
125
- - GitHub: [MagistrTheOne/Radon2BMistral](https://github.com/MagistrTheOne/Radon2BMistral)
126
- - Hugging Face: [MagistrTheOne/RadonSAI-Small](https://huggingface.co/MagistrTheOne/RadonSAI-Small)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RadonSAI-Small
2
+
3
+ ## Overview
4
+ RadonSAI-Small is a lightweight variant of the Radon model family, based on the GPT-2 architecture.
5
+
6
+ ## Source Model
7
+ - **Source**: gpt2
8
+ - **Model Class**: GPT2LMHeadModel
9
+ - **Parameters**: 124M (actual size from source)
10
+ - **Architecture**: GPT-2 Small
11
+
12
+ ## Artifacts
13
+ - `model.safetensors` - Model weights in safetensors format (~480MB)
14
+ - `tokenizer.json` - Tokenizer configuration
15
+ - `tokenizer_config.json` - Tokenizer metadata
16
+ - `vocab.json` - Vocabulary file
17
+ - `merges.txt` - BPE merge rules
18
+ - `config.json` - Model configuration (normalized)
19
+
20
+ ## How to Verify
21
+ ```bash
22
+ # Run inference test
23
+ python3 tests/test_inference_small.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ ## Conversion Steps
27
+ 1. Download gpt2 from Hugging Face
28
+ 2. Convert weights to safetensors format
29
+ 3. Save tokenizer files
30
+ 4. Normalize config JSON with correct architectures and model_type
31
+ 5. Validate with inference test
32
+
33
+ ## Notes
34
+ - This variant uses the original parameter count of the source model (124M)
35
+ - Target label suggests 21M parameters, but actual size is 124M from gpt2
36
+ - To achieve the target 21M parameters, consider:
37
+ - Knowledge distillation from a larger model
38
+ - Pruning techniques
39
+ - Training from scratch with reduced architecture
40
+
41
+ ## File Sizes
42
+ - Total folder size: ~500MB
43
+ - Model weights: ~480MB
44
+ - Tokenizer files: ~20MB
config.json CHANGED
@@ -1,17 +1,38 @@
1
  {
2
- "model_name": "radon",
3
- "model_type": "gpt2",
4
- "vocab_size": 32000,
5
- "hidden_size": 256,
6
- "num_layers": 6,
7
- "num_attention_heads": 8,
8
- "intermediate_size": 1024,
9
- "max_position_embeddings": 512,
10
- "dropout": 0.1,
11
- "attention_dropout": 0.1,
12
- "activation_function": "gelu",
13
- "layer_norm_eps": 1e-05,
14
  "initializer_range": 0.02,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  "use_cache": true,
16
- "torch_dtype": "float32"
17
- }
 
1
  {
2
+ "activation_function": "gelu_new",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 50256,
8
+ "dtype": "float32",
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
 
 
 
11
  "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "reorder_and_upcast_attn": false,
21
+ "resid_pdrop": 0.1,
22
+ "scale_attn_by_inverse_layer_idx": false,
23
+ "scale_attn_weights": true,
24
+ "summary_activation": null,
25
+ "summary_first_dropout": 0.1,
26
+ "summary_proj_to_labels": true,
27
+ "summary_type": "cls_index",
28
+ "summary_use_proj": true,
29
+ "task_specific_params": {
30
+ "text-generation": {
31
+ "do_sample": true,
32
+ "max_length": 50
33
+ }
34
+ },
35
+ "transformers_version": "4.57.0",
36
  "use_cache": true,
37
+ "vocab_size": 50257
38
+ }
config.yaml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ architecture: GPT2LMHeadModel
2
+ conversion_date: '2025-01-09'
3
+ format: safetensors
4
+ max_position_embeddings: 1024
5
+ model_name: RadonSAI-Small
6
+ model_type: gpt2
7
+ parameters: 124M
8
+ source_model: gpt2
9
+ vocab_size: 50257
generation_config.json CHANGED
@@ -2,5 +2,5 @@
2
  "_from_model_config": true,
3
  "bos_token_id": 50256,
4
  "eos_token_id": 50256,
5
- "transformers_version": "4.36.2"
6
  }
 
2
  "_from_model_config": true,
3
  "bos_token_id": 50256,
4
  "eos_token_id": 50256,
5
+ "transformers_version": "4.57.0"
6
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8a60b1808195c50cee30d3cfaf84aab26feb751225939e0a2da8f36af23eb7f5
3
- size 42515920
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7d00560d8910fbed77ffad4065dee5011c41ba401b1064e749c498ba9e20373
3
+ size 497774208
model_card.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ base_model: gpt2
2
+ inference:
3
+ parameters:
4
+ do_sample: true
5
+ max_new_tokens: 256
6
+ temperature: 0.7
7
+ top_p: 0.9
8
+ language:
9
+ - en
10
+ - ru
11
+ library_name: transformers
12
+ license: apache-2.0
13
+ model_type: gpt2
14
+ pipeline_tag: text-generation
15
+ tags:
16
+ - safetensors
17
+ - text-generation
18
+ - conversational
19
+ - machine-learning
20
+ - nlp
21
+ - transformer
22
+ - russian
23
+ - english
24
+ - small-model
25
+ - gpt2
special_tokens_map.json CHANGED
@@ -1,23 +1,5 @@
1
  {
2
- "bos_token": {
3
- "content": "<|endoftext|>",
4
- "lstrip": false,
5
- "normalized": true,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "<|endoftext|>",
11
- "lstrip": false,
12
- "normalized": true,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "unk_token": {
17
- "content": "<|endoftext|>",
18
- "lstrip": false,
19
- "normalized": true,
20
- "rstrip": false,
21
- "single_word": false
22
- }
23
  }
 
1
  {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "add_bos_token": false,
3
  "add_prefix_space": false,
4
  "added_tokens_decoder": {
5
  "50256": {
@@ -12,12 +11,10 @@
12
  }
13
  },
14
  "bos_token": "<|endoftext|>",
15
- "chat_template": "{% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}",
16
- "clean_up_tokenization_spaces": true,
17
  "eos_token": "<|endoftext|>",
18
- "errors": "replace",
19
  "model_max_length": 1024,
20
- "pad_token": null,
21
  "tokenizer_class": "GPT2Tokenizer",
22
  "unk_token": "<|endoftext|>"
23
  }
 
1
  {
 
2
  "add_prefix_space": false,
3
  "added_tokens_decoder": {
4
  "50256": {
 
11
  }
12
  },
13
  "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": false,
 
15
  "eos_token": "<|endoftext|>",
16
+ "extra_special_tokens": {},
17
  "model_max_length": 1024,
 
18
  "tokenizer_class": "GPT2Tokenizer",
19
  "unk_token": "<|endoftext|>"
20
  }