code-reranker-v1

A cross-encoder reranker for code search, trained on CodeSearchNet pairs. Experimental — does not improve retrieval in our benchmarks. Published for reproducibility.

Status: Negative Result

This reranker regresses retrieval quality on our hard eval (55 confusable function pairs):

Config Recall@1 Delta
No reranker 90.9% —
Web-trained cross-encoder 80.0% -10.9pp
This model (code-trained) 9.1% -81.8pp

Root cause: Trained with random same-language negatives, which are too easy for cross-encoders. The model learns surface-level language patterns instead of semantic code discrimination. A V2 with BM25 hard negatives may fix this.

Training

  • Architecture: Cross-encoder (BERT-base)
  • Data: 50,000 CodeSearchNet pairs + 7,500 docstring pairs
  • Epochs: 3
  • Negatives: Random same-language (this was the mistake)

Usage (if you want to experiment)

# In cqs — NOT default, opt-in only
CQS_RERANKER_MODEL=jamie8johnson/code-reranker-v1 cqs "query" --rerank

License

Apache 2.0.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jamie8johnson/code-reranker-v1