arxiv:2602.19020

Learning to Detect Language Model Training Data via Active Reconstruction

Published on Feb 22

· Submitted by

Junjie Oscar Yin on Feb 25

University of Washington NLP

Upvote

Authors:

Abstract

Active Data Reconstruction Attack uses reinforcement learning to identify training data by measuring the reconstructibility of text from model behavior, outperforming existing membership inference attacks.

AI-generated summary

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are more reconstructible than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, ADRA and its adaptive variant ADRA+, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

osieosie

Paper submitter 1 day ago

Many people are using RL to make models smarter.
We used RL to pull training data out of the models themselves.

Our results show that models know a lot more about their training data than most people think.

We develop Active Data Reconstruction Attack (ADRA) — a data detection method that uses RL to induce models to reconstruct data seen during training.
ADRA beats existing methods by an average of >10% across pre-training, post-training, and distillation

Release of our work, with @uwnlp, @Cornell, and @Berkeley_ai, is now available.

Arxiv: https://arxiv.org/pdf/2602.19020

Joint work with @jxmnop @shmatikov @sewon__min @HannaHajishirzi