Finetune loss never decrease

#36

by fahadh4ilyas - opened Jan 7

Jan 7

I tried to finetune this model using diffuser repo. Basically I'm following this blog's method: Diffusers welcomes FLUX-2. But, the loss result never decreased even after 5 epoch. Is there any missing step that I should do?

Here is the loss graph:

Here is the parameters:

#! /bin/bash

MLFLOW_TRACKING_URI=file:///nvme/fahadh/mlruns MLFLOW_EXPERIMENT_NAME=flux2-train accelerate launch \
  --config_file /nvme/fahadh/train-flux/config.yaml \
  /nvme/fahadh/train-flux/diffusers/examples/dreambooth/train_dreambooth_lora_flux2.py \
  --pretrained_model_name_or_path="/nvme/fahadh/models/FLUX.2-dev"  \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --cache_latents \
  --offload \
  --remote_text_encoder \
  --caption_column="caption"\
  --dataset_name="/nvme/fahadh/datasets/image-dataset" \
  --output_dir="/nvme/fahadh/train-flux/flux2_LoRA-final" \
  --instance_prompt="" \
  --train_batch_size=2 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=1 \
  --optimizer="prodigy" \
  --learning_rate=1.0 \
  --report_to="mlflow" \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=250 \
  --checkpoints_total_limit=2 \
  --num_train_epochs=5 \
  --rank=32 \
  --lora_alpha=32 \
  --lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0,attn.add_k_proj,attn.add_q_proj,attn.add_v_proj,attn.to_add_out,ff.net.0.proj,ff.net.2,ff_context.net.0.proj,ff_context.net.2" \
  --aspect_ratio_buckets="768,1376;1024,1024;1024,1536;1200,896;1376,768;1536,1024" \
  --seed="0" \
  --torch_clear_cache_step=50 \
  --bnb_quantization_config_path="/nvme/fahadh/train-flux/4bit_config.json"

condzero

Mar 31

•

edited Mar 31

I'm using AdamW8bit, LR=5e-5, batch=2, resolution=1024, GRADIENT_ACCUMULATION_STEPS = 4, WEIGHTING_SCHEME = "logit_normal", LORA_RANK = 64 , LORA_ALPHA = 64, LORA_DROPOUT = 0.1, contrary to what many people are saying I'm doing this: self.lr_scheduler = get_scheduler(
"constant",
optimizer=self.optimizer,
num_warmup_steps=100,
num_training_steps=self.max_train_steps,
)
I'm going for full training. you can optionally reduce rank / alpha to 32 from 64. This setup works for me on a Flux.2 Klein base 9B training run. I set max steps to 1500 and take checkpoints every 500 steps. This allows for going back if you think you're overfitting.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment