BEIR

university

AI & ML interests

BEIR (Benchmarking IR) consists of a homogenous benchmark for diverse sentence or passage level IR tasks. It provides a common and easy framework for the cross-domain evaluation of your retrieval models.

nthakur

authored a paper 7 months ago

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Paper • 2508.06600 • Published Aug 8, 2025 • 41

nthakur

in BeIR/nfcorpus 7 months ago

Convert to Parquet

#3 opened 7 months ago by

lhoestq

nthakur

in BeIR/msmarco 9 months ago

Qrel file missing

#3 opened 9 months ago by

Aabylay

nthakur

authored 3 papers 10 months ago

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Paper • 2504.13128 • Published Apr 17, 2025 • 7

Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

Paper • 2504.20006 • Published Apr 28, 2025

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Paper • 2505.16967 • Published May 22, 2025 • 24

nthakur

posted an update 11 months ago

Post

1862

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

nthakur

authored a paper about 1 year ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19, 2025 • 45

nthakur

authored 2 papers over 1 year ago

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Paper • 2410.13716 • Published Oct 17, 2024

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Paper • 2406.16828 • Published Jun 24, 2024 • 1

nthakur

posted an update almost 2 years ago

Post

3782

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

nthakur

authored 9 papers about 2 years ago

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

Paper • 2306.07471 • Published Jun 13, 2023

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Paper • 2312.11361 • Published Dec 18, 2023 • 1

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

Paper • 2307.16883 • Published Jul 31, 2023

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

Paper • 2010.08240 • Published Oct 16, 2020

Evaluating Embedding APIs for Information Retrieval

Paper • 2305.06300 • Published May 10, 2023 • 1

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Paper • 2112.07577 • Published Dec 14, 2021

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

Paper • 2210.09984 • Published Oct 18, 2022 • 2

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Paper • 2104.08663 • Published Apr 17, 2021 • 3

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Paper • 2311.05800 • Published Nov 10, 2023 • 4

AI & ML interests

Team members 2

BeIR's activity

Convert to Parquet

Qrel file missing