Hebrew Semantic Retrieval — 3rd Place Solution

Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

Result: 🥉 3rd place — nDCG@20 = 0.652538 (private test set) · 0.432286 (public test set)

Author: kdbrodt

Overview

This repository contains the complete inference code and fine-tuned models for the 3rd-place solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by NDCG@20.

The solution is a clean, end-to-end two-stage retrieve-then-rerank pipeline built entirely on the FlagEmbedding (BAAI/bge-m3) family. Both the dense embedder and the cross-encoder reranker were fine-tuned directly on the competition's annotated Hebrew data.

The Challenge

Property	Detail
Organizer	MAFAT DDR&D + Israel National NLP Program
Corpus size	127,731 Hebrew paragraphs
Data sources	Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols
Evaluation metric	NDCG@20
Phase I	Public leaderboard (Codabench)
Phase II	Private test set with additional human annotation of previously unseen retrievals
Relevance scale	0–4 (human annotated)

Solution Architecture

A straightforward two-stage pipeline: dense retrieval followed by cross-encoder reranking.

Query
  │
  ▼
[BGE-M3 Dense Retriever]  (fine-tuned, CLS pooling, FP16)
  │  cosine similarity over 127k passages
  ▼
Top-100 Candidates
  │
  ▼
[BGE-Reranker-v2-M3]  (fine-tuned binary classifier, FP16)
  │  query-passage pairs scored, max_length=512
  ▼
Final Top-20 Results

Stage 1 — Dense Retrieval

The fine-tuned bge-m3 encoder produces CLS-token embeddings (L2-normalized, FP16) for all corpus passages at preprocessing time. At query time, a single query embedding is computed and scored against all corpus embeddings via dot-product similarity (equivalent to cosine similarity on normalized vectors). The top-100 passages are selected for reranking.

Property	Value
Model	`test_encoder_only_base_bge_m3_new1` (fine-tuned `BAAI/bge-m3`)
Pooling	CLS token
Normalization	L2
Precision	FP16
Max length	512 tokens
Batch size (corpus)	64
Retrieval pool	Top-100 candidates

Stage 2 — Cross-Encoder Reranking

The top-100 candidates are re-scored by the fine-tuned bge-reranker-v2-m3, a sequence classification model that takes concatenated [query, passage] pairs as input and outputs a relevance logit. Passages are sorted by length before scoring to minimize padding overhead. The top-20 by reranker score are returned.

Property	Value
Model	`test_encoder_only_base_bge_reranker_v2_m3_new1` (fine-tuned `BAAI/bge-reranker-v2-m3`)
Max length	512 tokens
Batch size	16
Output	Top-20 by reranker logit

Fine-Tuning

Both models were fine-tuned on the competition's annotated Hebrew training set using the FlagEmbedding framework.

Training data construction:

Every query–document pair with a positive relevance score (> 0) was treated as a positive example.
Every pair with a score of 0 was treated as a negative example.

Embedder (bge-m3): Trained with KL-divergence loss to produce embeddings that better separate relevant from irrelevant documents.

Reranker (bge-reranker-v2-m3): Trained as a binary classifier on the same positive/negative pairs, learning to predict relevance probability directly.

Hyperparameter	Value
Epochs	2
Batch size per device	2
Learning rate	5e-6
Hardware	2 × Nvidia Tesla V100-SXM2-32GB
Training time	~1 hour

Included Models (fine-tuned)

Path in repo	Base model	Fine-tuning
`models/test_encoder_only_base_bge_m3_new1/`	`BAAI/bge-m3`	KL-divergence loss on competition data ✨
`models/test_encoder_only_base_bge_reranker_v2_m3_new1/`	`BAAI/bge-reranker-v2-m3`	Binary classification on competition data ✨

Repository Structure

model.py      ← Full inference pipeline (preprocess + predict)
prepare.py    ← Data preparation script
train.sh      ← Training script
models/
  test_encoder_only_base_bge_m3_new1/                  ← Fine-tuned BGE-M3 embedder ✨
  test_encoder_only_base_bge_reranker_v2_m3_new1/      ← Fine-tuned BGE reranker ✨

Usage

The pipeline exposes two functions matching the competition API:

from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "מה הזכויות של שוכרי דירה?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 1.23}, ...]  (top-20)

Requirements:

torch
transformers
numpy

Hardware: A CUDA-capable GPU is required. Inference takes less than 1.5 hours on an g5.xlarge instance.

Reproducing the Models

1. Prepare data:

# Download competition data and unzip into `hsrc/` folder
python prepare.py

2. Train:

sh ./train.sh

Training takes ~1 hour on 2 × V100-SXM2-32GB GPUs.

Technical Notes

Both models are loaded in FP16 via torch_dtype=torch.float16 and device_map for automatic GPU placement.
Corpus passages are sorted by length before embedding to reduce padding overhead during batch encoding.
The reranker also sorts candidates by passage length before scoring batches.
Fallback: if reranking fails, the pipeline falls back to returning the top-20 by dense retrieval score.

Results

Phase	NDCG@20	Rank
Public (Phase I)	0.432286	🥉 3rd
Private (Phase II)	0.652538	🥉 3rd

The large gap between public and private scores reflects the private phase's additional human annotation of previously un-annotated retrieved documents, significantly boosting NDCG for systems that retrieved relevant but unannotated paragraphs.

Citation

If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit kdbrodt as the solution author.

Acknowledgements

MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
The authors of BAAI/bge-m3 and BAAI/bge-reranker-v2-m3.
The FlagEmbedding team for the training framework.

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including HebArabNlpProject/Semantic-Retrieval-3rd-place

Hebrew Semantic Retrieval Competition Winners

Collection

3 items • Updated 1 day ago • 1