SatDiff-LFM2.5-VL-450M-stage1

A LoRA fine-tune of LiquidAI/LFM2.5-VL-450M trained to emit per-claim, contract-framed evidence on Sentinel-2 imagery of regulated dams and tailings storage facilities. Built for the SatDiff submission to the Liquid AI "AI in Space" hackathon (Liquid Track, 2026).

This is the evidence-text writer component of the SatDiff pipeline. It does not perform threshold-based severity classification on its own — that work is done deterministically by a Python rules engine downstream of the model. See Stage 2 (negative result) below for the methodological reasoning.

What it does

Given (a) Sentinel-2 imagery (RGB + NIR composites for baseline + current pass) and (b) a contract memo prompt that lists per-claim evidence-sourcing rules, this model produces a structured per-claim evidence string for a 5-claim audit schema covering:

  1. Impoundment morphology (deposition asymmetry, footprint change)
  2. Pond management (pond-to-wall distance, area change, turbidity, NDWI)
  3. Retaining-wall integrity (gully count + width, NDMI on the wall face, SWIR anomaly)
  4. Deformation (declares "no SAR data available" when SAR is absent)
  5. Protected-zone encroachment (towns, residential extensions)

Headline result (held-out evaluation)

The base LFM2.5-VL-450M parrots the same three indices into every claim's evidence regardless of which physical signal the claim is about. Stage 1 fine-tuning — without any SatDiff-specific examples — teaches the model to read the per-claim sourcing rules from the prompt and cite the right diff fields.

metric (held-out backtest passes, 6 dates × 5 claims = 30 claims) base Stage 1
schema-valid passes 6/6 6/6
evidence-correct claims 0/30 (0 %) 30/30 (100 %)

The lift comes from generic VRSBench grounding, not from domain-specific examples.

Training

  • Framework: Liquid4All/leap-finetune (Ray Train + Accelerate, managed via uv). Not raw transformers + peft.
  • Base model: LiquidAI/LFM2.5-VL-450M.
  • Dataset: VRSBench (NeurIPS 2024) — 5 000 captioning + VQA samples (no [refer] grounding tasks).
  • Method: LoRA SFT, rank as configured in the recipe, 2 epochs.
  • Hardware: single RTX 4080 Laptop, 12 GB VRAM, WSL2 + CUDA 12.6.
  • Wall-clock: 38 m 15 s.
  • Eval loss: base ~3.21 → Stage 1 1.41 (−56 %).

Stage 2 (negative result, not shipped)

A second-stage fine-tune was attempted using 29 hand-authored examples (boundary, escalation, routine, catastrophic regimes) plus 17 auto-generated examples from the SatDiff Phase 2 backtest, split 35 train / 11 held-out before training. Stage 2 preserved Stage 1's 100 % evidence-correctness but worsened severity adherence on real held-out data (5 → 9 corrections by the downstream rules engine).

Diagnosis: the hand-authored cases used contrived metric values around the threshold cliffs (e.g. pond-to-wall = 24 m vs 26 m); the held-out real-data passes lived in a different distribution (pond-to-wall ≈ 9.9 m throughout the failure window). LoRA at this scale (35 examples × 3 epochs ≈ 31 effective steps) cannot reshape multi-tier threshold reasoning.

We ship Stage 1, not Stage 2. The Stage 2 checkpoint is intentionally not uploaded — it would confuse the model-card story.

This is the strongest possible validation of the SatDiff rules-engine architecture: severity, action, escalation, and overall-status are computed deterministically by phase2.aggregate.compute_severity from physical-diff numbers, regardless of what the model emits. Stage 2 attempting and failing to lift the model's threshold reasoning confirms that this work belongs in deterministic Python at our scale.

Files in this repo

  • fp16 transformers checkpoint (model.safetensors + config.json + tokenizer.json + chat_template.jinja + …) — load via transformers.AutoModelForImageTextToText.
  • GGUF pair (gguf/LFM2.5-VL-450M-stage1-Q8_0.gguf + gguf/mmproj-LFM2.5-VL-450M-stage1-Q8_0.gguf) — load via llama.cpp / llama-server. 362 MB Q8_0 backbone + 182 MB F16 mmproj = 544 MB total.

Reproducing the headline result

# 1. Pull the SatDiff repo + Stage 1 GGUF, bring up the stack:
git clone <satdiff-repo>
cd satdiff
docker compose up -d

# 2. Run the Phase 2 backtest against the Stage 1 model:
python -m phase2.cli --asset jagersfontein \
  --date-range 2021-06-15,2022-10-15 \
  --inference llama_server \
  --model WobblyDopamine/SatDiff-LFM2.5-VL-450M-stage1-Q8_0

# 3. Render the per-pass PDF audit reports:
python -m phase3 --asset jagersfontein --date-range 2021-06-15,2022-10-15

End-to-end latency on RTX 4080 Laptop (sm_89, CUDA 12.6) using the GGUF + llama-server path: 2.38 s/pass — 4.7× over the bf16 transformers path with no schema or evidence-quality regression.

Intended use

This model is the evidence-text writer for the SatDiff TSF/dam compliance pipeline. It is not a general-purpose VLM and is not a severity classifier on its own. Use it as part of the rules-engine pipeline described above.

Limitations

  • Trained on a small VRSBench slice; performance outside the SatDiff contract prompt's sourcing rules is unknown.
  • Held-out evaluation was on 6 Jagersfontein passes; transfer to other TSF/dam assets is plausible (the lift comes from generic EO grounding) but unmeasured.
  • Severity grading is performed by a downstream Python rules engine, not by this model. Do not delegate threshold reasoning to the fine-tuned weights at this scale.
  • Cloud-cover gating is upstream; the model is not robust to severe cloud occlusion.

License

This work is released under the LFM Open License v1.0, inherited from the base model LiquidAI/LFM2.5-VL-450M. See the upstream license at https://huggingface.co/LiquidAI/LFM2.5-VL-450M/blob/main/LICENSE.

Citation

@misc{satdiff2026,
  title={SatDiff: A Satellite-Readable Compliance Contract for Tailings and Dam Monitoring},
  author={Scholz, Peter},
  year={2026},
  howpublished={Liquid AI "AI in Space" Hackathon submission, Liquid Track},
  note={Fine-tune of LiquidAI/LFM2.5-VL-450M on VRSBench. Stage 1 evidence-correctness 0/30 → 30/30 on held-out tailings-dam passes. See https://huggingface.co/WobblyDopamine/SatDiff-LFM2.5-VL-450M-stage1.}
}

Acknowledgements

  • Liquid AI for the LFM2.5-VL-450M base model and the leap-finetune framework.
  • DPhi for the SimSat API and the hackathon platform.
  • Torres-Cruz & O'Donovan (2023) for the Scientific Reports reconstruction of the Jagersfontein failure that anchored this submission's validation.
Downloads last month
108
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WobblyDopamine/SatDiff-LFM2.5-VL-450M-stage1

Adapter
(18)
this model