File size: 52,369 Bytes
d665db6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
==========================================                                                                                                                                                                                                                                                                                                                                                                                                             
Continued Pretraining                                                                                                                                                                                                                                                                                                                                                                                                                                  
==========================================                                                                                                                                                                                                                                                                                                                                                                                                             
Base: unsloth/Qwen3.5-2B-Base                                                                                                                                                                                                                                                                                                                                                                                                                          
Corpus: /workspace/new/cpt-bartleby/                                                                                                                                                                                                                                                                                                                                                                                                                   
Output: staeiou/bartleby-dlo-qwen3.5-2b-cpt                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
β†’ No local vLLM detected, proceeding with pretraining                                                                                                                                                                                                                                                                                                                                                                                                  
β†’ Starting continued pretraining...                                                                                                                                                                                                                                                                                                                                                                                                                    
BASE_MODEL=unsloth/Qwen3.5-2B-Base \                                                                                                                                                                                                                                                                                                                                                                                                                   
TOKENIZER_MODEL=unsloth/Qwen3.5-2B-Base \                                                                                                                                                                                                                                                                                                                                                                                                              
PRETRAIN_CORPUS_DIR=/workspace/new/cpt-bartleby/ \                                                                                                                                                                                                                                                                                                                                                                                                     PRETRAIN_OUTPUT_DIR=staeiou/bartleby-dlo-qwen3.5-2b-cpt \                                                                                                                                                                                                                                                                                                                                                                                              
PRETRAIN_MAX_SEQ_LENGTH=2048 \                                                                                                                                                                                                                                                                                                                                                                                                                         
PRETRAIN_MIN_DOC_CHARS=500 \                                                                                                                                                                                                                                                                                                                                                                                                                           
PRETRAIN_MAX_FILES=0 \                                                                                                                                                                                                                                                                                                                                                                                                                                 
PRETRAIN_PROGRESS_EVERY=25 \                                                                                                                                                                                                                                                                                                                                                                                                                           
PRETRAIN_LOG_EACH_FILE=0 \                                                                                                                                                                                                                                                                                                                                                                                                                             
PRETRAIN_TEXT_WORKERS=16 \                                                                                                                                                                                                                                                                                                                                                                                                                             
PRETRAIN_OCR_PDFS=1 \                                                                                                                                                                                                                                                                                                                                                                                                                                  
PRETRAIN_OCR_LANGUAGE=eng \                                                                                                                                                                                                                                                                                                                                                                                                                            
PRETRAIN_CACHE_DIR=.cache/pretrain \                                                                                                                                                                                                                                                                                                                                                                                                                   
PRETRAIN_DISABLE_CACHE=0 \                                                                                                                                                                                                                                                                                                                                                                                                                             
PRETRAIN_ATTN_IMPLEMENTATION= \                                                                                                                                                                                                                                                                                                                                                                                                                        
PRETRAIN_CACHE_FINGERPRINT= \                                                                                                                                                                                                                                                                                                                                                                                                                          
PRETRAIN_BLOCK_MIN_CHARS=40 \                                                                                                                                                                                                                                                                                                                                                                                                                          
PRETRAIN_MIN_ALPHA_RATIO=0.55 \                                                                                                                                                                                                                                                                                                                                                                                                                        
PRETRAIN_MAX_SYMBOL_RATIO=0.40 \                                                                                                                                                                                                                                                                                                                                                                                                                       
PRETRAIN_MAX_DIGIT_RATIO=0.40 \                                                                                                                                                                                                                                                                                                                                                                                                                        
PRETRAIN_MAX_SHORT_LINE_RATIO=0.67 \                                                                                                                                                                                                                                                                                                                                                                                                                   
PRETRAIN_MAX_CODE_LINE_RATIO=0.35 \                                                                                                                                                                                                                                                                                                                                                                                                                    
PRETRAIN_MAX_ADJACENT_REPEAT_SPAN=4 \                                                                                                                                                                                                                                                                                                                                                                                                                  
PRETRAIN_MIN_DUP_LINE_CHARS=24 \                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
PRETRAIN_PER_DEVICE_TRAIN_BATCH_SIZE=2 \                                                                                                                                                                                                                                                                                                                                                                                                               
PRETRAIN_GRADIENT_ACCUMULATION_STEPS=8 \                                                                                                                                                                                                                                                                                                                                                                                                               
PRETRAIN_NUM_TRAIN_EPOCHS=4 \                                                                                                                                                                                                                                                                                                                                                                                                                          
PRETRAIN_LEARNING_RATE=2e-5 \                                                                                                                                                                                                                                                                                                                                                                                                                          
PRETRAIN_LR_SCHEDULER_TYPE=cosine \                                                                                                                                                                                                                                                                                                                                                                                                                    
PRETRAIN_WARMUP_RATIO=0.05 \                                                                                                                                                                                                                                                                                                                                                                                                                           
PRETRAIN_WEIGHT_DECAY=0.01 \                                                                                                                                                                                                                                                                                                                                                                                                                           
PRETRAIN_LOGGING_STEPS=10 \                                                                                                                                                                                                                                                                                                                                                                                                                            
PRETRAIN_SAVE_STEPS=200 \                                                                                                                                                                                                                                                                                                                                                                                                                              
python continued_pretrain.py                                                                                                                                                                                                                                                                                                                                                                                                                           
================================================================================                                                                                                                                                                                                                                                                                                                                                                       BARTLEBY CONTINUED PRETRAINING                                                                                                                                                                                                                                                                                                                                                                                                                         
================================================================================                                                                                                                                                                                                                                                                                                                                                                       
BASE_MODEL   : unsloth/Qwen3.5-2B-Base                                                                                                                                                                                                                                                                                                                                                                                                                 TOKENIZER    : unsloth/Qwen3.5-2B-Base                                                                                                                                                                                                                                                                                                                                                                                                                 
CORPUS_DIR   : /workspace/new/cpt-bartleby                                                                                                                                                                                                                                                                                                                                                                                                             
OUTPUT_DIR   : bartleby-cpt                                                                                                                                                                                                                                                                                                                                                                                                                            
MIN_DOC_CHARS: 500                                                                                                                                                                                                                                                                                                                                                                                                                                     
PROGRESS_EVERY: 25                                                                                                                                                                                                                                                                                                                                                                                                                                     
LOG_EACH_FILE : False                                                                                                                                                                                                                                                                                                                                                                                                                                  
LOG_SLOW_FILES_SECONDS : 10.0                                                                                                                                                                                                                                                                                                                                                                                                                          
CACHE_DIR     : .cache/pretrain                                                                                                                                                                                                                                                                                                                                                                                                                        
DISABLE_CACHE : False                                                                                                                                                                                                                                                                                                                                                                                                                                  
CLEANING      : {'block_min_chars': 40, 'min_alpha_ratio': 0.55, 'max_symbol_ratio': 0.4, 'max_digit_ratio': 0.4, 'max_short_line_ratio': 0.67, 'max_code_line_ratio': 0.35, 'max_adjacent_repeat_span': 4, 'min_dup_line_chars': 24}                                                                                                                                                                                                                  
ATTN_IMPL     : eager                                                                                                                                                                                                                                                                                                                                                                                                                                  
MAX_SEQ      : 2048                                                                                                                                                                                                                                                                                                                                                                                                                                    
TRAIN        : bs=2 grad_accum=8 eff_bs=16                                                                                                                                                                                                                                                                                                                                                                                                             
EPOCHS       : 4.0                                                                                                                                                                                                                                                                                                                                                                                                                                     
LR           : 2e-05 warmup=0.05 weight_decay=0.01 scheduler=cosine                                                                                                                                                                                                                                                                                                                                                                                    
================================================================================Β  Corpus size: chars=250143778 approx_tokens=62535944 avg_chars_per_doc=267819                                                                                                                                                                                                                                                                                                                                                                           
Saving extracted text cache to .cache/pretrain/bdd47a97e4dbc523/documents                                                                                                                                                                                                                                                                                                                                                                              
Saving the dataset (1/1 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 934/934 [00:00<00:00, 35502.75 examples/s]                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[1/5] Loading tokenizer...                                                                                                                                                                                                                                                                                                                                                                                                                             
Tokenizer load attempt: {'use_fast': True, 'trust_remote_code': True}                                                                                                                                                                                                                                                                                                                                                                                  
Token fingerprint cache_dir=.cache/pretrain/148836db8ace83d9                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[2/5] Tokenizing documents...                                                                                                                                                                                                                                                                                                                                                                                                                          
tokenize:   0%|                                                                                                                                                                  | 0/934 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (901961 > 262144). Running this sequence through the model will result in indexing errors                                            
tokenize: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 934/934 [00:46<00:00, 20.21 examples/s]                                                                                                                                                                                                                                       
Tokenized corpus: tokens=60877244 approx_sequences_at_max_len=29725                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
[3/5] Packing into fixed-length blocks...                                                                                                                                                                                                                                                                                                                                                                                                              
pack: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 934/934 [00:24<00:00, 38.10 examples/s]                                                                                                                                                                                                                                       Filter: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29725/29725 [00:22<00:00, 1295.70 examples/s]                                                                                                                                                                                                                                       
Packed blocks: 29725                                                                                                                                                                                                                                                                                                                                                                                                                                   
Saving tokenized/packed cache to .cache/pretrain/148836db8ace83d9/packed_tokens                                                                                                                                                                                                                                                                                                                                                                        
Saving the dataset (1/1 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29725/29725 [00:00<00:00, 131749.31 examples/s]                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[4/5] Loading model...                                                                                                                                                                                                                                                                                                                                                                                                                                 
Model load attempt: transformers AutoModelForCausalLM                                                                                                                                                                                                                                                                                                                                                                                                  
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d                                                                                                                                                                                             
Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 320/320 [00:00<00:00, 5889.76it/s]                                                                                                                                                                                                                                       
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[5/5] Training causal LM CPTs...                                                                                                                                                                                                                                                                                                                                                                                                         
{'loss': '2.719', 'grad_norm': '0.3672', 'learning_rate': '4.839e-07', 'epoch': '0.005382'}                                                                                                                                                                                                                                                                                                                                                            
{'loss': '2.694', 'grad_norm': '0.3418', 'learning_rate': '1.022e-06', 'epoch': '0.01076'}                                                                                                                                                                                                                                                                                                                                                             
{'loss': '2.689', 'grad_norm': '0.3516', 'learning_rate': '1.559e-06', 'epoch': '0.01615'}                                                                                                                                                                                                                                                                                                                                                             
{'loss': '2.756', 'grad_norm': '0.375', 'learning_rate': '2.097e-06', 'epoch': '0.02153'}Β  [...] {'loss': '2.315', 'grad_norm': '0.3926', 'learning_rate': '1.497e-08', 'epoch': '3.934'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.189', 'grad_norm': '0.4082', 'learning_rate': '1.264e-08', 'epoch': '3.94'}                                                                                                                                                                                                                                                                                                                                                                
{'loss': '2.08', 'grad_norm': '0.4023', 'learning_rate': '1.05e-08', 'epoch': '3.945'}                                                                                                                                                                                                                                                                                                                                                                 
{'loss': '2.272', 'grad_norm': '0.3828', 'learning_rate': '8.562e-09', 'epoch': '3.951'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.337', 'grad_norm': '0.4082', 'learning_rate': '6.82e-09', 'epoch': '3.956'}                                                                                                                                                                                                                                                                                                                                                                
{'loss': '2.196', 'grad_norm': '0.3848', 'learning_rate': '5.276e-09', 'epoch': '3.961'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.216', 'grad_norm': '0.4062', 'learning_rate': '3.929e-09', 'epoch': '3.967'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.208', 'grad_norm': '0.4062', 'learning_rate': '2.781e-09', 'epoch': '3.972'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.185', 'grad_norm': '0.3965', 'learning_rate': '1.831e-09', 'epoch': '3.977'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.198', 'grad_norm': '0.3887', 'learning_rate': '1.078e-09', 'epoch': '3.983'}                                                                                                                                                                                                                                                                                                                                                               
Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.16s/it]                                                                                                                                                                                                                                                                {'loss': '2.265', 'grad_norm': '0.416', 'learning_rate': '5.237e-10', 'epoch': '3.988'}                                                                                                                                                                                                                                                                                                                                                                
{'loss': '2.237', 'grad_norm': '0.3867', 'learning_rate': '1.673e-10', 'epoch': '3.994'}                                                                                                                                                                                                                                                                                                                                                               
{'loss': '2.18', 'grad_norm': '0.4082', 'learning_rate': '8.911e-12', 'epoch': '3.999'}                                                                                                                                                                                                                                                                                                                                                                
Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.27s/it]                                                                                                                                                                                                                                                                
{'train_runtime': '1.114e+05', 'train_samples_per_second': '1.067', 'train_steps_per_second': '0.067', 'train_loss': '2.402', 'epoch': '4'}                                                                                                                                                                                                                                                                                                            
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7432/7432 [30:57:06<00:00, 14.99s/it]                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                       
Saving...                                                                                                                                                                                                                                                                                                                                                                                                                                              
Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.22s/it]