Instructions to use golfoscar/mistral-7b-v0.1-nca-arc-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use golfoscar/mistral-7b-v0.1-nca-arc-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("golfoscar/mistral-7b-sft-arc") model = PeftModel.from_pretrained(base_model, "golfoscar/mistral-7b-v0.1-nca-arc-lora") - Notebooks
- Google Colab
- Kaggle
Mistral-7B-v0.1-NCA-ARC (LoRA adapter)
LoRA adapter (~167 MB) for golfoscar/mistral-7b-sft-arc (in-house SFT on top of mistralai/Mistral-7B-v0.1), reaching 0.6834 ± 0.0136 acc_norm on ARC-Challenge 25-shot — +6.91 pp over the pretrained Mistral-7B-v0.1 baseline (0.6143). Submitted for the Upstage AIRE LLM/Deep-Learning coding test.
A merged fp16 version is at golfoscar/mistral-7b-v0.1-nca-arc (~14 GB, self-contained).
Adapter specs
- LoRA r=16, α=32, dropout=0.05;
target_modules="all-linear"(224 modules = 32 layers × 7 projections); 41.94 M trainable params (1.106 % of the 3.79 B base).
Recipe summary
- Reference SFT on a curated 4-source science-MCQ corpus →
golfoscar/mistral-7b-sft-arc(0.6323). - NCA preference fine-tune with TRL's
nca_pair(β=0.10) on 8,619 preference pairs → 0.6775. - Hard-Margin NCA continuation (β=0.05, lr=1e-6, 2 epochs, this adapter) on the bottom-25% pairs by the prior model's implicit-reward margin (2,155 pairs, 70 % OpenBookQA-skewed) → 0.6834.
load_best_model_at_end=Truemakes best and final adapters byte-identical.
The dominant lever is NCA's explicit suppression of rejected log-probabilities (8 nats over 269 steps while chosen log-prob stays stable). The hard-margin curriculum adds a smaller but consistent +0.59 pp (0.43σ, single-seed PARTIAL pass).
Performance — ARC-Challenge 25-shot (acc_norm)
| Run | acc_norm | Δ baseline |
|---|---|---|
| Mistral-7B-v0.1 (pretrained) | 0.6143 | — |
| Reference SFT | 0.6323 | +1.80 pp |
DPO (sigmoid, β=0.05) |
0.6544 ± 0.0139 | +4.01 pp |
NCA (nca_pair, β=0.10) |
0.6775 ± 0.0137 | +6.32 pp |
| This adapter (NCA + hard-margin) | 0.6834 ± 0.0136 | +6.91 pp |
| Likra paper (independent pipeline, reference only) | 0.6630 | — |
How to use
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"golfoscar/mistral-7b-sft-arc",
torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, "golfoscar/mistral-7b-v0.1-nca-arc-lora")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=False)
Reproducing 0.6834
pip install lm-eval==0.4.5 # GitHub repo: lm-evaluation-harness
python -m lm_eval \
--model hf \
--model_args "pretrained=golfoscar/mistral-7b-sft-arc,peft=golfoscar/mistral-7b-v0.1-nca-arc-lora,tokenizer=mistralai/Mistral-7B-v0.1,dtype=bfloat16,load_in_4bit=True,bnb_4bit_quant_type=nf4,bnb_4bit_compute_dtype=bfloat16,bnb_4bit_use_double_quant=True" \
--tasks arc_challenge \
--num_fewshot 25 \
--batch_size 64
Tested on A100 40 GB, torch==2.5.1+cu124, transformers==4.49.0, peft==0.15.2, bitsandbytes==0.43.3, lm-eval==0.4.5. On some CUDA driver / bitsandbytes combinations the live 4-bit + LoRA forward path may fail — in that case use the merged fp16 repo which packages the same parameters as a self-contained checkpoint.
Training data
Single in-house preference corpus, 8,619 (prompt, chosen, rejected) triples:
| Source | Pairs | License |
|---|---|---|
| ARC-Challenge (train+val) | 3,378 | CC-BY-SA 4.0 |
| OpenBookQA | 2,802 | Apache 2.0 |
| ARC-Easy (train+val) | 1,878 | CC-BY-SA 4.0 |
| QASC | 561 | CC-BY 4.0 |
Pairs are filtered for: empty fields, chosen == rejected, exact-match with ARC-Challenge test prompts (zero leakage), word counts in [2, 50], length balance max/min ≤ 3.0 (Smaug-style length-classifier failure), exact triple-level dedup. SciQ (CC-BY-NC + empirical regression) and MedMCQA (out-of-distribution) are excluded.
Hard-margin subset (for stage 3): bottom 25 % by m = log π(chosen|x) − log π(rejected|x) under the stage-2 model (2,155 pairs; 413 with non-positive margin; 70.1 % OpenBookQA-skewed). Zero ARC-test leakage at both prompt-level and 13-gram level.
Training procedure
QLoRA on golfoscar/mistral-7b-sft-arc (4-bit NF4 base, compute_dtype=fp16, double-quant), LoRA r=16 / α=32 / dropout 0.05, target_modules="all-linear" (224 modules), paged_adamw_8bit optimizer, batch 2 × grad-accum 16 (effective 32), cosine schedule with warmup 0.03, max_length=512, seed=42.
| Stage | Loss | β | LR | Data | Epochs | Steps |
|---|---|---|---|---|---|---|
| NCA (stage 2) | nca_pair |
0.10 | 5e-6 | full 8,619 | 1 | 269 |
| Hard-margin NCA (stage 3, this adapter) | nca_pair |
0.05 | 1e-6 | bottom-25 % 2,155 | 2 | 122 |
Stage-3 training: 78 min on a single A100 40 GB, peak 5.51 GB VRAM, zero NaN/Inf, zero abort triggers, eval loss monotonically decreases from 1.3088 to 1.3008 across 6 evaluations.
Bias, Risks, and Limitations
- Domain-specialized: trained only on grade-school science MCQ. Math, code, chat, instruction-following, summarization are not characterized and likely worse than general-purpose Mistral fine-tunes.
- No instruction or safety alignment beyond science-MCQ preference. Do not use for safety-critical applications.
- Prompt format dependence: trained for the bare
Question: {q}\nAnswer:template used bylm-evaluation-harness arc_challenge.yaml; other formats may degrade performance. - Single seed (42): the +0.59 pp stage-3 gain over stage-2 is ~0.43σ on a single seed (PARTIAL pass, not significant at 1σ). Multi-seed replication is the documented next step.
- English only.
License
Released under Apache 2.0 (inherited from mistralai/Mistral-7B-v0.1). Training data: Apache 2.0 / CC-BY 4.0 / CC-BY-SA 4.0 mixture (all commercial-safe). No CC-BY-NC data was used.
Citation
@inproceedings{chen2024nca,
title={Noise Contrastive Alignment of Language Models with Explicit Rewards},
author={Chen, Huayu and others},
booktitle={NeurIPS 2024 / arXiv:2402.05369},
year={2024}
}
@article{dettmers2023qlora,
title={QLoRA: Efficient Finetuning of Quantized LLMs},
author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal={arXiv:2305.14314},
year={2023}
}
@article{clark2018arc,
title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
author={Clark, Peter and others},
journal={arXiv:1803.05457},
year={2018}
}
Author
- Hyeonseop Shin — hysp.shin@gmail.com / hyeonseop.shin@snu.ac.kr
- Submission for the Upstage AIRE LLM/Deep-Learning coding test, May 2026.
- Downloads last month
- 33
Model tree for golfoscar/mistral-7b-v0.1-nca-arc-lora
Base model
golfoscar/mistral-7b-sft-arcDatasets used to train golfoscar/mistral-7b-v0.1-nca-arc-lora
allenai/openbookqa
allenai/qasc
Papers for golfoscar/mistral-7b-v0.1-nca-arc-lora
Noise Contrastive Alignment of Language Models with Explicit Rewards
QLoRA: Efficient Finetuning of Quantized LLMs
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Evaluation results
- Normalized Accuracy (acc_norm) on AI2 Reasoning Challenge (ARC-Challenge)test set self-reported0.683
- Accuracy (acc) on AI2 Reasoning Challenge (ARC-Challenge)test set self-reported0.677