Llama-3.1-70B-Aurora-Chat v3
LoRA fine-tune of meta-llama/Llama-3.1-70B-Instruct specialized for the
ALCF Aurora supercomputer (Intel Xeon Sapphire
Rapids + Intel GPU Max 1550 / Ponte Vecchio, oneAPI / SYCL, PBS Pro).
Off-the-shelf code-LLMs hallucinate Aurora specifics โ they suggest nvcc instead of
icpx -fsycl, srun / aprun instead of mpiexec, NERSC's /global/cfs instead of
/lus/flare, and CUDA device strings instead of xpu. This adapter teaches the base
model the actual Aurora toolchain, file system layout, scheduler conventions, and
recommended PyTorch/TensorFlow/SYCL idioms.
Model summary
| Base model | meta-llama/Llama-3.1-70B-Instruct |
| Format | Merged 16-bit โ HuggingFace Transformers / vLLM / TGI |
| Fine-tuning | LoRA (PEFT) โ r=32, ฮฑ=64, dropout 0.0, 2 epochs |
| Optimizer | AdamW fused, lr 2e-4 cosine, warmup 3%, batch 1 ร grad-accum 8 |
| Precision / seq-len | bf16, 1,536 tokens |
| Training data | aurora-docs-distill-multirank โ 4,495 ChatML rows |
| Train loss (final) | 0.5800 |
| Hardware | Aurora node, model-parallel across 1โ12 PVC tiles via HF device_map='auto', IPEX + PyTorch 2.10 XPU backend |
| Eval (53-Q Aurora, 0โ5) | pending |
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("shazzadulimun/llama31-70b-aurora-chat-v3")
mdl = AutoModelForCausalLM.from_pretrained("shazzadulimun/llama31-70b-aurora-chat-v3", torch_dtype=torch.bfloat16, device_map="auto")
msgs = [{"role": "user", "content": "How do I launch one MPI rank per Aurora GPU tile?"}]
ids = tok(tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True), return_tensors="pt").to(mdl.device)
print(tok.decode(mdl.generate(**ids, max_new_tokens=400, temperature=0.0)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
Training data
Distilled from openai/gpt-oss-120b on ALCF Sophia (vLLM) over 416 cleaned chunks of
docs.alcf.anl.gov/aurora. 4,495
training rows + 562 validation rows in ChatML format with embedded
chain-of-thought (**Reasoning:** / **Answer:**).
Broad coverage, parallel-rank distillation. 20 worker ranks each took a disjoint slice (~21 chunks) of the cleaned docs.alcf.anl.gov/aurora corpus and asked the teacher for chain-of-thought QA pairs. Disjoint slicing maximizes phrasing diversity (each rank sees fresh context) while still covering every chunk exactly once.
Full corpus + reproduction scripts: SIslamMun/Generator @ aurora-datasets-2026-04-30.
Evaluation
Part of the v3 parameter-size sweep (1B โ 120B trained on the same dataset). Holdout scorecard appears here once the full sweep completes.
Limitations
- Synthetic-data biases. Teacher (
gpt-oss-120b) can confabulate plausible-looking but incorrect commands. Treat outputs as a verifiable first draft, not authoritative. - Doc snapshot is fixed at 2026-04-29. Module versions, queue names, and APIs change โ anything published after that date isn't reflected here.
- Aurora-only. Specifics (
/lus/flare,xpu, PBS queues) won't transfer to Frontier, Polaris, or other systems. - Use temperature โค 0.1 for technical answers; higher temps invite invented flag names and paths.
Citation
@misc{aurora-llms-2026,
title = { Llama-3.1-70B-Aurora-Chat v3 },
author = { Islam Mun, Shazzadul },
year = { 2026 },
url = { https://huggingface.co/shazzadulimun/llama31-70b-aurora-chat-v3 },
note = { LoRA fine-tune of Llama-3.1-70B-Instruct; data distilled from gpt-oss-120b on docs.alcf.anl.gov/aurora }
}
License
Apache-2.0 for the adapter weights and synthetic training data. Source corpus is public
ALCF user documentation. Base model retains its own license โ see
meta-llama/Llama-3.1-70B-Instruct.
- Downloads last month
- 24
Model tree for shazzadulimun/llama31-70b-aurora-chat-v3
Base model
meta-llama/Llama-3.1-70B