Hebrew Semantic Retrieval β 3rd Place Solution
Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program
Result: π₯ 3rd place β nDCG@20 = 0.652538 (private test set) Β· 0.432286 (public test set)
Author: kdbrodt
Overview
This repository contains the complete inference code and fine-tuned models for the 3rd-place solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by NDCG@20.
The solution is a clean, end-to-end two-stage retrieve-then-rerank pipeline built entirely on the FlagEmbedding (BAAI/bge-m3) family. Both the dense embedder and the cross-encoder reranker were fine-tuned directly on the competition's annotated Hebrew data.
The Challenge
| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0β4 (human annotated) |
Solution Architecture
A straightforward two-stage pipeline: dense retrieval followed by cross-encoder reranking.
Query
β
βΌ
[BGE-M3 Dense Retriever] (fine-tuned, CLS pooling, FP16)
β cosine similarity over 127k passages
βΌ
Top-100 Candidates
β
βΌ
[BGE-Reranker-v2-M3] (fine-tuned binary classifier, FP16)
β query-passage pairs scored, max_length=512
βΌ
Final Top-20 Results
Stage 1 β Dense Retrieval
The fine-tuned bge-m3 encoder produces CLS-token embeddings (L2-normalized, FP16) for all corpus passages at preprocessing time. At query time, a single query embedding is computed and scored against all corpus embeddings via dot-product similarity (equivalent to cosine similarity on normalized vectors). The top-100 passages are selected for reranking.
| Property | Value |
|---|---|
| Model | test_encoder_only_base_bge_m3_new1 (fine-tuned BAAI/bge-m3) |
| Pooling | CLS token |
| Normalization | L2 |
| Precision | FP16 |
| Max length | 512 tokens |
| Batch size (corpus) | 64 |
| Retrieval pool | Top-100 candidates |
Stage 2 β Cross-Encoder Reranking
The top-100 candidates are re-scored by the fine-tuned bge-reranker-v2-m3, a sequence classification model that takes concatenated [query, passage] pairs as input and outputs a relevance logit. Passages are sorted by length before scoring to minimize padding overhead. The top-20 by reranker score are returned.
| Property | Value |
|---|---|
| Model | test_encoder_only_base_bge_reranker_v2_m3_new1 (fine-tuned BAAI/bge-reranker-v2-m3) |
| Max length | 512 tokens |
| Batch size | 16 |
| Output | Top-20 by reranker logit |
Fine-Tuning
Both models were fine-tuned on the competition's annotated Hebrew training set using the FlagEmbedding framework.
Training data construction:
- Every queryβdocument pair with a positive relevance score (> 0) was treated as a positive example.
- Every pair with a score of 0 was treated as a negative example.
Embedder (bge-m3): Trained with KL-divergence loss to produce embeddings that better separate relevant from irrelevant documents.
Reranker (bge-reranker-v2-m3): Trained as a binary classifier on the same positive/negative pairs, learning to predict relevance probability directly.
| Hyperparameter | Value |
|---|---|
| Epochs | 2 |
| Batch size per device | 2 |
| Learning rate | 5e-6 |
| Hardware | 2 Γ Nvidia Tesla V100-SXM2-32GB |
| Training time | ~1 hour |
Included Models (fine-tuned)
| Path in repo | Base model | Fine-tuning |
|---|---|---|
models/test_encoder_only_base_bge_m3_new1/ |
BAAI/bge-m3 |
KL-divergence loss on competition data β¨ |
models/test_encoder_only_base_bge_reranker_v2_m3_new1/ |
BAAI/bge-reranker-v2-m3 |
Binary classification on competition data β¨ |
Repository Structure
model.py β Full inference pipeline (preprocess + predict)
prepare.py β Data preparation script
train.sh β Training script
models/
test_encoder_only_base_bge_m3_new1/ β Fine-tuned BGE-M3 embedder β¨
test_encoder_only_base_bge_reranker_v2_m3_new1/ β Fine-tuned BGE reranker β¨
Usage
The pipeline exposes two functions matching the competition API:
from model import preprocess, predict
# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)
# Query at inference time
results = predict({"query": "ΧΧ ΧΧΧΧΧΧΧͺ Χ©Χ Χ©ΧΧΧ¨Χ ΧΧΧ¨Χ?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 1.23}, ...] (top-20)
Requirements:
torch
transformers
numpy
Hardware: A CUDA-capable GPU is required. Inference takes less than 1.5 hours on an g5.xlarge instance.
Reproducing the Models
1. Prepare data:
# Download competition data and unzip into `hsrc/` folder
python prepare.py
2. Train:
sh ./train.sh
Training takes ~1 hour on 2 Γ V100-SXM2-32GB GPUs.
Technical Notes
- Both models are loaded in FP16 via
torch_dtype=torch.float16anddevice_mapfor automatic GPU placement. - Corpus passages are sorted by length before embedding to reduce padding overhead during batch encoding.
- The reranker also sorts candidates by passage length before scoring batches.
- Fallback: if reranking fails, the pipeline falls back to returning the top-20 by dense retrieval score.
Results
| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | 0.432286 | π₯ 3rd |
| Private (Phase II) | 0.652538 | π₯ 3rd |
The large gap between public and private scores reflects the private phase's additional human annotation of previously un-annotated retrieved documents, significantly boosting NDCG for systems that retrieved relevant but unannotated paragraphs.
Citation
If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit kdbrodt as the solution author.
Acknowledgements
- MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
- The authors of
BAAI/bge-m3andBAAI/bge-reranker-v2-m3. - The FlagEmbedding team for the training framework.