code-daemon-reranker-v1

A small, fast cross-encoder reranker purpose-built to re-order first-stage code-search hits for precision. It ships with the UltraCode MCP server as a TensorRT / OpenVINO / TVM engine, scoring (query, candidate) pairs after the embedding retriever has fetched a candidate pool.

A reranker is the second stage: the bi-encoder embed model (code-daemon-embed-v1) retrieves a pool fast, then this cross-encoder reads each (query, code) pair jointly and emits a single relevance logit, pulling the best match to the top. Joint attention over the pair is far more precise than the cosine of two independent vectors — at the cost of one forward pass per candidate, so it scores only a bounded pool (~64), not the whole index.

~117M params — XLM-RoBERTa 12 layers / 384 hidden, 250k multilingual SentencePiece vocab (the embedding table dominates the size).
2-input ONNX (input_ids, attention_mask; no token_type_ids) → a single relevance logit.
Max sequence 256 tokens for the concatenated (query, document) pair.
Listwise-trained — the key quality lever (below).

How it was made

Warm-started from cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 (a strong multilingual MS MARCO cross-encoder) and fine-tuned with a listwise loss — ListNet top-1 softmax cross-entropy over each query's {1 positive + ≤8 hard negatives} group. The hard negatives were mined by the code retriever nomic-ai/CodeRankEmbed from CoIR (NL→code, code-dominant 0.75 share): the documents the first stage confuses with the answer are exactly what the reranker must learn to push down. Trained on ~99k query groups, 2 epochs, on a single A100.

Why listwise (not pointwise BCE): a listwise loss optimizes the order of the whole candidate group, not each pair in isolation, so it spends capacity on the top of the list — which is all a reranker is for. Against the same model trained with pointwise BCE, listwise lifted Hit@1 +0.12 and MRR +0.10 on our golden set; recall (Hit@5/@10) is unchanged because that ceiling is set by the first-stage retriever, not the reranker.

Built for speed

Short context (256) — a (query, code-unit) pair is short; there is no long-document path.
Bounded pool — the daemon reranks only the top ~64 fused candidates (the relevant file often sits at fused rank 30–60), then cuts back to top-k after reranking.
Runs on the iGPU. On a box with both NVIDIA + Intel, the daemon routes the reranker to the Intel iGPU (OpenVINO fp16) and keeps CUDA free for the embedding model — the two search stages run on different devices in parallel. TensorRT (NVIDIA) and TVM/Vulkan engines are bundled for boxes without an Intel iGPU.
fp16 — mmarco-format INT8 on the iGPU hits a known OpenVINO AccessViolation, so fp16 is shipped.

Intended use

Re-rank a candidate pool from a first-stage retriever for NL→code search (multilingual text works too). Feed (query, candidate) pairs, take the logit, sort descending. The score is a raw, unbounded logit (Identity head) — compare relatively within a query, not against a fixed threshold.

import onnxruntime as ort, numpy as np
from transformers import AutoTokenizer

tok  = AutoTokenizer.from_pretrained(".")            # bundled XLM-R SentencePiece
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def rerank(query, docs, max_len=256):
    enc = tok([query] * len(docs), docs, padding=True, truncation=True,
              max_length=max_len, return_tensors="np", return_token_type_ids=False)
    logits = sess.run(None, {"input_ids":      enc["input_ids"].astype(np.int64),
                             "attention_mask": enc["attention_mask"].astype(np.int64)})[0]
    return sorted(zip(logits.reshape(-1).tolist(), docs), reverse=True)   # higher = more relevant

What's in this repo — ready-to-run compiled engines

Named per runtime × GPU arch × OS (single-profile — no length buckets):

TensorRT code-daemon-reranker-v1_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine — NVIDIA, fp16 (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx).
OpenVINO code-daemon-reranker-v1_ov_{cpu,igpu}_fp16_b16_s256.{xml,bin} — Intel CPU / iGPU.
TVM code-daemon-reranker-v1_*_tvm_vulkan.{dll,so} — Vulkan fallback for non-TRT / other GPUs.
Tokenizer — sentencepiece.bpe.model + tokenizer_config.json (XLM-R SP; the daemon loads it directly).
ONNX source — model.onnx FP32 (the build source + standalone onnxruntime / optimum use).

Evaluation

Measured on the daemon's own search-gold golden set (26 NL→code queries — its real query distribution), reranking the embed retriever's top-64 pool. Metrics are advisory; manual review of the failures is the source of truth.

metric	embed-only	+ reranker	Δ
Hit@1	0.42	0.54	+0.12
Hit@3	0.62	0.77	+0.15
Hit@5	0.73	0.77	+0.04
Hit@10	0.81	0.81	0 (recall ceiling — set by the retriever)
MRR@10	0.55	0.65	+0.10
nDCG@10	0.59	0.67	+0.08

The gains concentrate on the top of the list (Hit@1, MRR, nDCG) — the listwise signature. Hit@10 is flat because a reranker can only reorder what the retriever already fetched.

Performance

Backend	Hardware	Rerank latency
OpenVINO fp16	Intel iGPU (Xe)	~550 ms / 64-candidate pool (~8.6 ms/candidate)

The daemon runs the reranker on the iGPU so the NVIDIA GPU stays free for the embedding model; the TensorRT path is faster per pass where an NVIDIA GPU is used for reranking.

License & training data

Released under the MIT license (the mmarco base + XLM-R backbone are MIT/Apache; fine-tuned weights released MIT). Training-data transparency:

Source	Note
`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` (warm-start base)	mMARCO ← MS MARCO → non-commercial research terms
CoIR (NL→code hard-negative mining)	code-retrieval corpora; mixed upstream provenance
hard-neg miner `nomic-ai/CodeRankEmbed`	MIT

⚠️ The warm-start base derives from MS MARCO (non-commercial). Whether a fine-tuned model inherits dataset-use terms is legally unsettled; this is not legal advice. Retrain from a permissive base if strict compliance is required.

Attribution

Warm-started from cross-encoder/mmarco-mMiniLMv2-L12-H384-v1. Hard negatives mined with nomic-ai/CodeRankEmbed (MIT). Backbone: XLM-RoBERTa.

Downloads last month: 42

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for faxenoff/code-daemon-reranker-v1

Base model

nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large

Quantized

cross-encoder/mmarco-mMiniLMv2-L12-H384-v1