RF-DETR (Segmentation)

RF-DETR is a real-time detection transformer family introduced in RF-DETR: Neural Architecture Search for Real-Time Detection Transformers by Robinson et al. and integrated in 🤗 Transformers via PR #36895. Disclaimer: This model was originally contributed by stevenbucaille in 🤗 transformers.

Model description

RF-DETR is an end-to-end instance segmentation model that combines ideas from LW-DETR and Deformable DETR: a DINOv2-with-registers style ViT backbone (with an RF-DETR windowing pattern for efficient attention), a multi-scale projector between encoder and decoder, and a multi-scale deformable DETR decoder extended with an instance-segmentation head.

Key Architectural Details:

  • Backbone: DINOv2-with-registers style ViT with RF-DETR windowed / full attention alternation.
  • Multi-scale fusion: RF-DETR multi-scale projector (C2f-style blocks in the LW-DETR lineage) to aggregate multi-level backbone features before the decoder.
  • Decoder: Deformable DETR-style decoder with multi-scale deformable cross-attention; segmentation checkpoints add mask prediction on top of box/class outputs.
  • Queries: DETR-style object queries with bipartite matching and auxiliary decoder losses.

Training Details:

  • Segmentation losses: mask prediction losses (e.g. focal / dice style terms as configured) in addition to box and classification objectives, with auxiliary decoder supervision.
  • Group DETR: parallel decoder copies during training for faster convergence.
  • NAS (family-level): weight-sharing search over accuracy–latency knobs as in the RF-DETR paper, specialized to the target dataset distribution.

How to use

You can use the raw model for instance segmentation; it predicts per-instance masks together with bounding boxes and class scores. See the model hub to look for all available RF-DETR models.

Here is how to use this model:

from transformers import AutoImageProcessor, RfDetrForInstanceSegmentation
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("stevenbucaille/rf-detr-segmentation")
model = RfDetrForInstanceSegmentation.from_pretrained("stevenbucaille/rf-detr-segmentation")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

target_sizes = [image.size[::-1]]
results = processor.post_process_instance_segmentation(
    outputs, target_sizes=target_sizes, threshold=0.5
)
for item in results:
    for k, v in item.items():
        if hasattr(v, "shape"):
            print(k, tuple(v.shape))
        else:
            print(k, v)

This should output:

segmentation (480, 640)
segments_info []

Training data

These checkpoints are trained on the standard COCO 2017 instance segmentation label space (80 thing categories) as reflected in config.id2label.

BibTeX entry and citation info

@misc{robinson2026rfdetrneuralarchitecturesearch,
      title={RF-DETR: Neural Architecture Search for Real-Time Detection Transformers},
      author={Isaac Robinson and Peter Robicheaux and Matvei Popov and Deva Ramanan and Neehar Peri},
      year={2026},
      eprint={2511.09554},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.09554},
}
Downloads last month
28
Safetensors
Model size
34.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stevenbucaille/rf-detr-segmentation

Paper for stevenbucaille/rf-detr-segmentation