Learning to Detect Language Model Training Data via Active Reconstruction
Abstract
Active Data Reconstruction Attack uses reinforcement learning to identify training data by measuring the reconstructibility of text from model behavior, outperforming existing membership inference attacks.
Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are more reconstructible than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, ADRA and its adaptive variant ADRA+, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.
Community
Many people are using RL to make models smarter.
We used RL to pull training data out of the models themselves.
Our results show that models know a lot more about their training data than most people think.
We develop Active Data Reconstruction Attack (ADRA) — a data detection method that uses RL to induce models to reconstruct data seen during training.
ADRA beats existing methods by an average of >10% across pre-training, post-training, and distillation
Release of our work, with @uwnlp, @Cornell, and @Berkeley_ai, is now available.
Arxiv: https://arxiv.org/pdf/2602.19020
Joint work with @jxmnop @shmatikov @sewon__min @HannaHajishirzi
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Detecting RLVR Training Data via Structural Convergence of Reasoning (2026)
- AttenMIA: LLM Membership Inference Attack through Attention Signals (2026)
- Membership Inference on LLMs in the Wild (2026)
- Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning (2026)
- PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection (2026)
- Reinforced Fast Weights with Next-Sequence Prediction (2026)
- Powerful Training-Free Membership Inference Against Autoregressive Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper