Instructions to use yusufdxb/prefailurenet-ui-prefail-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yusufdxb/prefailurenet-ui-prefail-v2 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("yusufdxb/prefailurenet-ui-prefail-v2", dtype="auto") - Notebooks
- Google Colab
- Kaggle
YAML Metadata Warning:The pipeline tag "time-series-classification" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
- PreFailureNet / UI-PreFail-v2
- Headline Result vs Shallow Baseline
- v3: xor_order_color defeats v2
- Exact Positioning
- What This Package Contains
- Tested Environment And Disk Requirements
- Why v1 Was Flawed
- What v2 Fixes
- Held-Out v2 Pair Controls
- Shortcut Probe Table
- Reference Results
- Reproduction Commands
- Artifact Map
- Checksums
- Verified
- Unverified / Not Claimed
- Limitations
- License
- Next Research Steps
- Headline Result vs Shallow Baseline
RESEARCH BENCHMARK, NOT FOR DEPLOYMENT. This package is a synthetic counterfactual benchmark for studying ordered-temporal evidence in pre-failure detection. It is not a browser-agent guardrail and has no real-world validation.
PreFailureNet / UI-PreFail-v2
PreFailureNet / UI-PreFail-v2 is a synthetic ordered temporal counterfactual benchmark for pre-failure detection in browser-agent-like workflows. It includes shortcut probes, baseline ladders, a reference Transformer, and calibrated gate diagnostics. It is not a production browser-agent guardrail.
Headline Result vs Shallow Baseline
Sources: transformer_v2_35e_panel.json, baselines.json (entry all_frames).
| Method | AP (failure vs stable) | n_test |
|---|---|---|
| Transformer v2, 35 epochs | 0.9926 |
256 |
| All-frames logistic regression (flattened frames) | 0.8762 |
256 |
A shallow logistic regression over flattened frames already reaches AP 0.8762 on this split, so the v2 headline reflects a benchmark that does not require deep temporal modeling. See v3 below for a split where this gap closes.
v3: xor_order_color defeats v2
v3 introduces the xor_order_color family (local commit 3a323b1, file src/prefailurenet/data/synthetic.py, function _temporal_order_v3_pair). The label is the XOR of an event-order bit and a per-pair color bit painted identically into every frame, so ordered-frame linear probes lose their handle while the pair structure stays intact.
On v3 (test seed 2920, n_test=256, 128/128 pairs), the all-frames ordered LR drops from AP 0.8762 on v2 to AP 0.5486 on v3 (rounded to 0.549 in docs/v3_split_evaluation.md), which is below the project's 0.60 hard gate and within 0.05 of chance. Source: `/Projects/PreFailureNet/runs/benchmark_v3/baselines_seed2920/baselines.json(entryall_frames, field ap_failure_vs_stable), cross-referenced in docs/v3_split_evaluation.md`.
The v2 Transformer recipe retrained on v3 flatlines:
| Field | Value |
|---|---|
precursor_detection_ap |
0.5068 |
intervention_auroc |
0.5002 |
Source: ~/Projects/PreFailureNet/runs/benchmark_v3/transformer_v3/metrics.json (raw values 0.5068114175400058 and 0.500152587890625).
v3 inverts the v2 narrative. v2 results should be read as a defeated benchmark, not as evidence that the v2 Transformer architecture learns temporal order.
v3 will be published as a separate HF repo or addendum after a v3-targeted model exists. Currently no model beats v3 baseline.
Exact Positioning
This package is a benchmark and research artifact. The correct claim is:
Ordered temporal evidence is required under synthetic counterfactual controls (v2). v3 shows this benchmark is itself defeated by an XOR-color counterfactual; the v2 headline is therefore a benchmark, not a capability claim.
Do not read this release as evidence of production browser-agent safety, real-world website generalization, or a deployable intervention gate.
What This Package Contains
- A public claims ledger.
- A compact artifact index.
- Benchmark v2 summary metrics.
- Shortcut probes and baseline ladder.
- Pair-matching and temporal-signal diagnostics.
- Reference Transformer v2 held-out test metrics.
- Risk-only gate calibration diagnostics.
- Reproduction commands.
- SHA256 checksums for all shipped artifacts (
CHECKSUMS.sha256).
Raw NPZ arrays and model checkpoint weights are not included in this compact HF package. They are reproducible from the commands below.
Tested Environment And Disk Requirements
This milestone was validated on Ubuntu 22.04 / Linux 6.8, x86_64, Python 3.10.12. Use Python
3.10 for reproduction; the project metadata allows Python >=3.10, but the release validation
environment is Python 3.10.
CUDA is optional for the published benchmark checks. The generator, diagnostics, shortcut baselines, HF package checker, and tests run on CPU. Transformer training can use CPU or CUDA depending on the local Torch install; no CUDA, Jetson, TensorRT, browser-extension, or live agent deployment result is claimed.
Disk planning:
- Compact HF package: under 1 MB.
- Reproduced v2 NPZ splits: about 40 MB for train/calibration/test.
- Existing
runs/benchmark_v2artifact directory: about 50 MB. - Python environment with Torch, OpenCV, scikit-learn, and Pillow: budget 2-5 GB.
Why v1 Was Flawed
The original temporal counterfactual split was useful but flawed:
- It matched state, context, action tokens, and final frame.
- The label signal lived in 1-2 mid-sequence visual patches.
- Patch position and active steps were family-structured.
- A logistic regression on mid-frame pixels reached AP
0.9249and pair-rank0.9875. - The strongest v1 Transformer reached AP
0.8746, so it still lost to the mid-frame LR probe.
Conclusion: v1 was shortcut-controlled, but it rewarded localized mid-frame patch detection more than temporal reasoning. v1 remains documented as a negative result and shortcut-control scaffold.
What v2 Fixes
UI-PreFail-v2 / temporal_order_v2 changes the label rule:
- Safe and unsafe samples have the same UI template/style/distractors.
- Safe and unsafe samples have identical state, context, and action tokens.
- Safe and unsafe samples have the same final frame.
- Safe and unsafe samples have the same unordered frame bag.
- The label depends on the order of two event-bearing frames.
This makes static, final-frame, non-vision, single mid-frame, and unordered-bag probes weak by construction. Ordered frame sequences and temporal deltas still contain useful signal. (v3 closes the remaining handle. See above.)
Held-Out v2 Pair Controls
Source: data_diagnostics.json.
| Control | Held-out test result |
|---|---|
| pair count | 128 |
| state-identical pairs | 128 / 128 |
| context-identical pairs | 128 / 128 |
| action-identical pairs | 128 / 128 |
| final-frame-identical pairs | 128 / 128 |
| unordered-frame-bag-identical pairs | 128 / 128 |
Shortcut Probe Table
Source: baselines.json.
| Probe | AP | Pair-rank | Interpretation |
|---|---|---|---|
| last-state LR | 0.5000 |
0.0000 |
non-vision shortcut controlled |
| all-actions LR | 0.5000 |
0.0000 |
action shortcut controlled |
| last-frame LR | 0.5005 |
0.0859 |
final-frame shortcut controlled |
| single mid-frame LR | 0.5524 |
0.2734 |
weak, not dominant |
| all-frame bag LR | 0.5048 |
0.5781 |
unordered frame bag controlled |
| mid-frames LR | 0.7746 |
0.7891 |
partially order-aware |
| all-frames LR | 0.8762 |
0.8828 |
order-aware linear pixels work |
| frame-deltas LR | 0.8525 |
0.8516 |
temporal deltas are useful |
Reference Results
Source: transformer_v2_35e_panel.json and transformer_v2_minrecall080_fpr020.json.
| Method | AP | Macro F1 used | Pair-rank | Pair intervention success | FPR @ predicted class | Recall @ predicted class |
|---|---|---|---|---|---|---|
| Transformer v2, 35 epochs | 0.9926 |
0.9911 |
1.0000 |
0.9844 |
0.0156 |
1.0000 |
Gate calibration selected on the calibration split with min_recall=0.80, max_false_intervention_rate=0.20:
| Split | Recall | FPR | Precision |
|---|---|---|---|
| calibration | 0.9922 |
0.0000 |
1.0000 |
| held-out test | 1.0000 |
0.0078 |
0.9922 |
This is a synthetic benchmark calibration result, not a deployment threshold.
Reproduction Commands
Run from the repository root:
PYTHONPATH=src python3 scripts/generate_synthetic.py \
--domain software --split temporal_order_v2 --split-role train \
--num-samples 512 --seed 910 --output-dir runs/benchmark_v2/data
PYTHONPATH=src python3 scripts/generate_synthetic.py \
--domain software --split temporal_order_v2 --split-role calibration \
--num-samples 256 --seed 1910 --output-dir runs/benchmark_v2/data
PYTHONPATH=src python3 scripts/generate_synthetic.py \
--domain software --split temporal_order_v2 --split-role test \
--num-samples 256 --seed 2910 --output-dir runs/benchmark_v2/data
PYTHONPATH=src python3 scripts/temporal_data_diagnostics.py \
--train runs/benchmark_v2/data/software_synthetic_n512_seed910.npz \
--test runs/benchmark_v2/data/software_synthetic_n256_seed2910.npz \
--output-dir runs/benchmark_v2/diagnostics
PYTHONPATH=src python3 scripts/temporal_baselines.py \
--train runs/benchmark_v2/data/software_synthetic_n512_seed910.npz \
--test runs/benchmark_v2/data/software_synthetic_n256_seed2910.npz \
--output-dir runs/benchmark_v2/baselines
PYTHONPATH=src python3 scripts/train.py --config configs/benchmark_v2_transformer.yaml
PYTHONPATH=src python3 scripts/temporal_model_eval.py \
--config configs/benchmark_v2_transformer.yaml \
--checkpoint runs/benchmark_v2/transformer_v2/best.pt \
--name transformer_v2_35e \
--split test \
--output-dir runs/benchmark_v2/model_panels
PYTHONPATH=src python3 scripts/calibrate_gate.py \
--config configs/benchmark_v2_transformer.yaml \
--checkpoint runs/benchmark_v2/transformer_v2/best.pt \
--output runs/benchmark_v2/gate_calibration/transformer_v2_minrecall080_fpr020.json \
--min-recall 0.80 \
--max-false-intervention-rate 0.20
PYTHONPATH=src python3 scripts/build_benchmark_v2_summary.py \
--root runs/benchmark_v2 \
--output runs/benchmark_v2/benchmark_v2_summary.json
Package validation:
PYTHONPATH=src python3 scripts/check_hf_package.py
Artifact Map
| File | Purpose |
|---|---|
claims_ledger.md |
Public claim boundary. |
assets/release_visual.svg |
Compact v1 flaw to v2 ordered-event benchmark visual. |
artifact_index.json |
Machine-readable package contents. |
benchmark_v2_summary.json |
Consolidated v2 metrics and decision. |
baselines.json |
Shortcut probes and baseline ladder. |
baselines.md |
Human-readable baseline table. |
data_diagnostics.json |
Pair controls and temporal signal diagnostics. |
data_diagnostics.md |
Human-readable diagnostics. |
transformer_v2_35e_panel.json |
Reference Transformer held-out test metrics. |
transformer_v2_minrecall080_fpr020.json |
Gate calibration diagnostics. |
CHECKSUMS.sha256 |
SHA256 checksums for all shipped non-README artifacts. |
Checksums
Verify with sha256sum -c CHECKSUMS.sha256 from the repo root.
| File | SHA256 |
|---|---|
artifact_index.json |
7fb1523a96a0a7f23c4cf343fc6f5a27d7bd5fa4a844f92304b19afcca4d396e |
claims_ledger.md |
9e20b26b5cec972be2882f4cb1333bbc18d15335c0199c78ba682113704bfb03 |
assets/release_visual.svg |
8ac6b51e1449f19eac5edf1d4e5ce5d74c0b0f3a259c3ef8629c0d141999bbc3 |
baselines.json |
29979a6d89071cf05e1b7ffa5ec27ae967430f24ef47b5e8d7601aef54625a93 |
baselines.md |
051c5426158bf0957cac905844222f38ef18ea3634ba78b47e7a6e9036adf1f8 |
benchmark_v2_summary.json |
428c87622bcb6191c4ce2081d0543da0b9a3e4c9425c10168d9e83ba4a2cbb22 |
data_diagnostics.json |
85c704597307aa2484bda87fd5215f35cc3217cc05d1dc8a6a978d8f1792a5a9 |
data_diagnostics.md |
361f8077142e9704fb6d45edf14c8d5c0b0c35e926a99487d78f6c2722b2ed67 |
transformer_v2_35e_panel.json |
016dd558e2e08fa70cbb9b6f78bdd332e69d2416be78b13dad38affa7aa1574c |
transformer_v2_minrecall080_fpr020.json |
a85a02fe164e1493ccddc55d4275951a10422957e6ec62c15587be5a781d1a90 |
Verified
- v2 generator exists.
- v2 matched ordered-event pairs exist.
- v2 shortcut probes exist.
- v2 baseline ladder exists.
- Transformer v2 result exists.
- Gate calibration result exists.
- v1 flaw is documented.
- v3 XOR counterfactual split exists locally and defeats both the ordered-LR baseline and the v2 Transformer recipe (see v3 section above).
Unverified / Not Claimed
- No production browser traffic.
- No live browser-agent deployment.
- No deployable browser-agent gate.
- No real-world website generalization.
- No claim that deep temporal reasoning is required.
- No claim that this benchmark proves browser-agent safety.
Limitations
- The benchmark is synthetic.
- The UI events are rendered primitives, not production browser traces.
- The reference result is on generated data only.
- A flattened all-frames LR is strong because it sees ordered frames; v2 tests ordered temporal evidence, not uniquely deep temporal reasoning. The v3 XOR split removes even this handle and currently has no model that beats baseline.
- Gate calibration is a benchmark diagnostic, not a safety policy.
- The class schema contains unused classes in this split.
License
Apache-2.0 for the package code, artifacts, and model card. Training and test data are synthetic and generated by the reproduction commands in this repo. No third-party dataset license applies.
Next Research Steps
- Add external browser-agent traces with consent and clear provenance.
- Add stronger adversarial temporal controls where full-frame linear probes degrade (v3 is the first such step).
- Add held-out UI event grammars and templates.
- Evaluate under real browser screenshots and DOM/action logs.
- Separate benchmark scoring from intervention-policy deployment work.
- Train a v3-targeted model that actually beats v3 baseline before publishing a v3 HF repo.
Evaluation results
- AP (failure vs stable) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported0.993
- Macro F1 (used classes) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported0.991
- Pair-rank accuracy on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported1.000
- Pair intervention success on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported0.984
- FPR at predicted class on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported0.016
- Recall at predicted class on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported1.000
- Gate test recall on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported1.000
- Gate test FPR (false intervention rate) on UI-PreFail-v2 temporal_order_v2 (seed 2910 test split, n=256)self-reported0.008