RF-DETR (Segmentation)
RF-DETR is a real-time detection transformer family introduced in RF-DETR: Neural Architecture Search for Real-Time Detection Transformers by Robinson et al. and integrated in 🤗 Transformers via PR #36895. Disclaimer: This model was originally contributed by stevenbucaille in 🤗 transformers.
Model description
RF-DETR is an end-to-end instance segmentation model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder extended with an instance-segmentation head.
Key Architectural Details:
- Backbone: DINOv2-with-registers style ViT with RF-DETR windowed / full attention alternation.
- Multi-scale fusion: RF-DETR multi-scale projector (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder.
- Decoder: Deformable DETR-style decoder with multi-scale deformable cross-attention; segmentation checkpoints add mask prediction on top of box/class outputs.
- Queries: DETR-style object queries with bipartite matching and auxiliary decoder losses.
Training Details:
- Segmentation losses: mask prediction losses (e.g. focal / dice style terms as configured) in addition to box and classification objectives, with auxiliary decoder supervision.
- Group DETR: parallel decoder copies during training for faster convergence.
- NAS (family-level): weight-sharing search over accuracy–latency knobs as in the RF-DETR paper, specialized to the target dataset distribution.
How to use
You can use the raw model for instance segmentation; it predicts per-instance masks together with bounding boxes and class scores. See the model hub to look for all available RF-DETR models.
Here is how to use this model:
from transformers import AutoImageProcessor, RfDetrForInstanceSegmentation
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-segmentation")
model = RfDetrForInstanceSegmentation.from_pretrained("stevenbucaille/rf-detr-segmentation")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
target_sizes = [image.size[::-1]]
results = processor.post_process_instance_segmentation(
outputs, target_sizes=target_sizes, threshold=0.5
)
for item in results:
for k, v in item.items():
if hasattr(v, "shape"):
print(k, tuple(v.shape))
else:
print(k, v)
This should output:
segmentation (480, 640)
segments_info []
Training data
These checkpoints are trained on the standard COCO 2017 instance segmentation label space (80 thing categories) as reflected in config.id2label.
BibTeX entry and citation info
@misc{robinson2026rfdetrneuralarchitecturesearch,
title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
year={2026},
eprint={2511.09554},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.09554},
}
- Downloads last month
- 28