code-daemon-reranker-v1
A small, fast cross-encoder reranker purpose-built to re-order first-stage code-search hits for precision. It ships with the UltraCode MCP server as a TensorRT / OpenVINO / TVM engine, scoring (query, candidate) pairs after the embedding retriever has fetched a candidate pool.
A reranker is the second stage: the bi-encoder embed model
(code-daemon-embed-v1) retrieves a pool fast,
then this cross-encoder reads each (query, code) pair jointly and emits a single relevance logit,
pulling the best match to the top. Joint attention over the pair is far more precise than the cosine of
two independent vectors — at the cost of one forward pass per candidate, so it scores only a bounded
pool (~64), not the whole index.
- ~117M params — XLM-RoBERTa 12 layers / 384 hidden, 250k multilingual SentencePiece vocab (the embedding table dominates the size).
- 2-input ONNX (
input_ids,attention_mask; notoken_type_ids) → a single relevance logit. - Max sequence 256 tokens for the concatenated (query, document) pair.
- Listwise-trained — the key quality lever (below).
How it was made
Warm-started from cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
(a strong multilingual MS MARCO cross-encoder) and fine-tuned with a listwise loss — ListNet
top-1 softmax cross-entropy over each query's {1 positive + ≤8 hard negatives} group. The hard
negatives were mined by the code retriever
nomic-ai/CodeRankEmbed from CoIR (NL→code,
code-dominant 0.75 share): the documents the first stage confuses with the answer are exactly what the
reranker must learn to push down. Trained on ~99k query groups, 2 epochs, on a single A100.
Why listwise (not pointwise BCE): a listwise loss optimizes the order of the whole candidate group, not each pair in isolation, so it spends capacity on the top of the list — which is all a reranker is for. Against the same model trained with pointwise BCE, listwise lifted Hit@1 +0.12 and MRR +0.10 on our golden set; recall (Hit@5/@10) is unchanged because that ceiling is set by the first-stage retriever, not the reranker.
Built for speed
- Short context (256) — a (query, code-unit) pair is short; there is no long-document path.
- Bounded pool — the daemon reranks only the top ~64 fused candidates (the relevant file often sits at fused rank 30–60), then cuts back to top-k after reranking.
- Runs on the iGPU. On a box with both NVIDIA + Intel, the daemon routes the reranker to the Intel iGPU (OpenVINO fp16) and keeps CUDA free for the embedding model — the two search stages run on different devices in parallel. TensorRT (NVIDIA) and TVM/Vulkan engines are bundled for boxes without an Intel iGPU.
- fp16 — mmarco-format INT8 on the iGPU hits a known OpenVINO AccessViolation, so fp16 is shipped.
Intended use
Re-rank a candidate pool from a first-stage retriever for NL→code search (multilingual text works too). Feed (query, candidate) pairs, take the logit, sort descending. The score is a raw, unbounded logit (Identity head) — compare relatively within a query, not against a fixed threshold.
import onnxruntime as ort, numpy as np
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(".") # bundled XLM-R SentencePiece
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
def rerank(query, docs, max_len=256):
enc = tok([query] * len(docs), docs, padding=True, truncation=True,
max_length=max_len, return_tensors="np", return_token_type_ids=False)
logits = sess.run(None, {"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64)})[0]
return sorted(zip(logits.reshape(-1).tolist(), docs), reverse=True) # higher = more relevant
What's in this repo — ready-to-run compiled engines
Named per runtime × GPU arch × OS (single-profile — no length buckets):
- TensorRT
code-daemon-reranker-v1_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine— NVIDIA, fp16 (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx). - OpenVINO
code-daemon-reranker-v1_ov_{cpu,igpu}_fp16_b16_s256.{xml,bin}— Intel CPU / iGPU. - TVM
code-daemon-reranker-v1_*_tvm_vulkan.{dll,so}— Vulkan fallback for non-TRT / other GPUs. - Tokenizer —
sentencepiece.bpe.model+tokenizer_config.json(XLM-R SP; the daemon loads it directly). - ONNX source —
model.onnxFP32 (the build source + standaloneonnxruntime/optimumuse).
Evaluation
Measured on the daemon's own search-gold golden set (26 NL→code queries — its real query
distribution), reranking the embed retriever's top-64 pool. Metrics are advisory; manual review of the
failures is the source of truth.
| metric | embed-only | + reranker | Δ |
|---|---|---|---|
| Hit@1 | 0.42 | 0.54 | +0.12 |
| Hit@3 | 0.62 | 0.77 | +0.15 |
| Hit@5 | 0.73 | 0.77 | +0.04 |
| Hit@10 | 0.81 | 0.81 | 0 (recall ceiling — set by the retriever) |
| MRR@10 | 0.55 | 0.65 | +0.10 |
| nDCG@10 | 0.59 | 0.67 | +0.08 |
The gains concentrate on the top of the list (Hit@1, MRR, nDCG) — the listwise signature. Hit@10 is flat because a reranker can only reorder what the retriever already fetched.
Performance
| Backend | Hardware | Rerank latency |
|---|---|---|
| OpenVINO fp16 | Intel iGPU (Xe) | ~550 ms / 64-candidate pool (~8.6 ms/candidate) |
The daemon runs the reranker on the iGPU so the NVIDIA GPU stays free for the embedding model; the TensorRT path is faster per pass where an NVIDIA GPU is used for reranking.
License & training data
Released under the MIT license (the mmarco base + XLM-R backbone are MIT/Apache; fine-tuned weights released MIT). Training-data transparency:
| Source | Note |
|---|---|
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 (warm-start base) |
mMARCO ← MS MARCO → non-commercial research terms |
| CoIR (NL→code hard-negative mining) | code-retrieval corpora; mixed upstream provenance |
hard-neg miner nomic-ai/CodeRankEmbed |
MIT |
⚠️ The warm-start base derives from MS MARCO (non-commercial). Whether a fine-tuned model inherits dataset-use terms is legally unsettled; this is not legal advice. Retrain from a permissive base if strict compliance is required.
Attribution
Warm-started from cross-encoder/mmarco-mMiniLMv2-L12-H384-v1. Hard negatives mined with nomic-ai/CodeRankEmbed (MIT). Backbone: XLM-RoBERTa.
- Downloads last month
- 42