| ========================================== |
| Continued Pretraining |
| ========================================== |
| Base: unsloth/Qwen3.5-2B-Base |
| Corpus: /workspace/new/cpt-bartleby/ |
| Output: staeiou/bartleby-dlo-qwen3.5-2b-cpt |
| |
| β No local vLLM detected, proceeding with pretraining |
| β Starting continued pretraining... |
| BASE_MODEL=unsloth/Qwen3.5-2B-Base \ |
| TOKENIZER_MODEL=unsloth/Qwen3.5-2B-Base \ |
| PRETRAIN_CORPUS_DIR=/workspace/new/cpt-bartleby/ \ PRETRAIN_OUTPUT_DIR=staeiou/bartleby-dlo-qwen3.5-2b-cpt \ |
| PRETRAIN_MAX_SEQ_LENGTH=2048 \ |
| PRETRAIN_MIN_DOC_CHARS=500 \ |
| PRETRAIN_MAX_FILES=0 \ |
| PRETRAIN_PROGRESS_EVERY=25 \ |
| PRETRAIN_LOG_EACH_FILE=0 \ |
| PRETRAIN_TEXT_WORKERS=16 \ |
| PRETRAIN_OCR_PDFS=1 \ |
| PRETRAIN_OCR_LANGUAGE=eng \ |
| PRETRAIN_CACHE_DIR=.cache/pretrain \ |
| PRETRAIN_DISABLE_CACHE=0 \ |
| PRETRAIN_ATTN_IMPLEMENTATION= \ |
| PRETRAIN_CACHE_FINGERPRINT= \ |
| PRETRAIN_BLOCK_MIN_CHARS=40 \ |
| PRETRAIN_MIN_ALPHA_RATIO=0.55 \ |
| PRETRAIN_MAX_SYMBOL_RATIO=0.40 \ |
| PRETRAIN_MAX_DIGIT_RATIO=0.40 \ |
| PRETRAIN_MAX_SHORT_LINE_RATIO=0.67 \ |
| PRETRAIN_MAX_CODE_LINE_RATIO=0.35 \ |
| PRETRAIN_MAX_ADJACENT_REPEAT_SPAN=4 \ |
| PRETRAIN_MIN_DUP_LINE_CHARS=24 \ |
| |
| PRETRAIN_PER_DEVICE_TRAIN_BATCH_SIZE=2 \ |
| PRETRAIN_GRADIENT_ACCUMULATION_STEPS=8 \ |
| PRETRAIN_NUM_TRAIN_EPOCHS=4 \ |
| PRETRAIN_LEARNING_RATE=2e-5 \ |
| PRETRAIN_LR_SCHEDULER_TYPE=cosine \ |
| PRETRAIN_WARMUP_RATIO=0.05 \ |
| PRETRAIN_WEIGHT_DECAY=0.01 \ |
| PRETRAIN_LOGGING_STEPS=10 \ |
| PRETRAIN_SAVE_STEPS=200 \ |
| python continued_pretrain.py |
| ================================================================================ BARTLEBY CONTINUED PRETRAINING |
| ================================================================================ |
| BASE_MODEL : unsloth/Qwen3.5-2B-Base TOKENIZER : unsloth/Qwen3.5-2B-Base |
| CORPUS_DIR : /workspace/new/cpt-bartleby |
| OUTPUT_DIR : bartleby-cpt |
| MIN_DOC_CHARS: 500 |
| PROGRESS_EVERY: 25 |
| LOG_EACH_FILE : False |
| LOG_SLOW_FILES_SECONDS : 10.0 |
| CACHE_DIR : .cache/pretrain |
| DISABLE_CACHE : False |
| CLEANING : {'block_min_chars': 40, 'min_alpha_ratio': 0.55, 'max_symbol_ratio': 0.4, 'max_digit_ratio': 0.4, 'max_short_line_ratio': 0.67, 'max_code_line_ratio': 0.35, 'max_adjacent_repeat_span': 4, 'min_dup_line_chars': 24} |
| ATTN_IMPL : eager |
| MAX_SEQ : 2048 |
| TRAIN : bs=2 grad_accum=8 eff_bs=16 |
| EPOCHS : 4.0 |
| LR : 2e-05 warmup=0.05 weight_decay=0.01 scheduler=cosine |
| ================================================================================Β Corpus size: chars=250143778 approx_tokens=62535944 avg_chars_per_doc=267819 |
| Saving extracted text cache to .cache/pretrain/bdd47a97e4dbc523/documents |
| Saving the dataset (1/1 shards): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 934/934 [00:00<00:00, 35502.75 examples/s] |
| |
| [1/5] Loading tokenizer... |
| Tokenizer load attempt: {'use_fast': True, 'trust_remote_code': True} |
| Token fingerprint cache_dir=.cache/pretrain/148836db8ace83d9 |
| |
| [2/5] Tokenizing documents... |
| tokenize: 0%| | 0/934 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (901961 > 262144). Running this sequence through the model will result in indexing errors |
| tokenize: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 934/934 [00:46<00:00, 20.21 examples/s] |
| Tokenized corpus: tokens=60877244 approx_sequences_at_max_len=29725 |
| [3/5] Packing into fixed-length blocks... |
| pack: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 934/934 [00:24<00:00, 38.10 examples/s] Filter: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29725/29725 [00:22<00:00, 1295.70 examples/s] |
| Packed blocks: 29725 |
| Saving tokenized/packed cache to .cache/pretrain/148836db8ace83d9/packed_tokens |
| Saving the dataset (1/1 shards): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29725/29725 [00:00<00:00, 131749.31 examples/s] |
| |
| [4/5] Loading model... |
| Model load attempt: transformers AutoModelForCausalLM |
| The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https: |
| Loading weights: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 320/320 [00:00<00:00, 5889.76it/s] |
| warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead. |
| |
| [5/5] Training causal LM CPTs... |
| {'loss': '2.719', 'grad_norm': '0.3672', 'learning_rate': '4.839e-07', 'epoch': '0.005382'} |
| {'loss': '2.694', 'grad_norm': '0.3418', 'learning_rate': '1.022e-06', 'epoch': '0.01076'} |
| {'loss': '2.689', 'grad_norm': '0.3516', 'learning_rate': '1.559e-06', 'epoch': '0.01615'} |
| {'loss': '2.756', 'grad_norm': '0.375', 'learning_rate': '2.097e-06', 'epoch': '0.02153'}Β [...] {'loss': '2.315', 'grad_norm': '0.3926', 'learning_rate': '1.497e-08', 'epoch': '3.934'} |
| {'loss': '2.189', 'grad_norm': '0.4082', 'learning_rate': '1.264e-08', 'epoch': '3.94'} |
| {'loss': '2.08', 'grad_norm': '0.4023', 'learning_rate': '1.05e-08', 'epoch': '3.945'} |
| {'loss': '2.272', 'grad_norm': '0.3828', 'learning_rate': '8.562e-09', 'epoch': '3.951'} |
| {'loss': '2.337', 'grad_norm': '0.4082', 'learning_rate': '6.82e-09', 'epoch': '3.956'} |
| {'loss': '2.196', 'grad_norm': '0.3848', 'learning_rate': '5.276e-09', 'epoch': '3.961'} |
| {'loss': '2.216', 'grad_norm': '0.4062', 'learning_rate': '3.929e-09', 'epoch': '3.967'} |
| {'loss': '2.208', 'grad_norm': '0.4062', 'learning_rate': '2.781e-09', 'epoch': '3.972'} |
| {'loss': '2.185', 'grad_norm': '0.3965', 'learning_rate': '1.831e-09', 'epoch': '3.977'} |
| {'loss': '2.198', 'grad_norm': '0.3887', 'learning_rate': '1.078e-09', 'epoch': '3.983'} |
| Writing model shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:03<00:00, 3.16s/it] {'loss': '2.265', 'grad_norm': '0.416', 'learning_rate': '5.237e-10', 'epoch': '3.988'} |
| {'loss': '2.237', 'grad_norm': '0.3867', 'learning_rate': '1.673e-10', 'epoch': '3.994'} |
| {'loss': '2.18', 'grad_norm': '0.4082', 'learning_rate': '8.911e-12', 'epoch': '3.999'} |
| Writing model shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:03<00:00, 3.27s/it] |
| {'train_runtime': '1.114e+05', 'train_samples_per_second': '1.067', 'train_steps_per_second': '0.067', 'train_loss': '2.402', 'epoch': '4'} |
| 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7432/7432 [30:57:06<00:00, 14.99s/it] |
| |
| Saving... |
| Writing model shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:03<00:00, 3.22s/it] |
| |