Abstract

This work builds upon the Basketball Action Recognition Dataset (BARD), originally introduced to enable supervised learning for primary action recognition in NBA game footage. However, BARD's initial design lacks the granular annotations required to develop multi-stage computer vision pipelines involving object detection, jersey number recognition (JNR) and team attribution. To address these limitations, we present E-BARD (Extended Basketball Action Recognition Dataset), which bridges the gap between isolated action recognition and end-to-end scene-level reasoning through three key contributions.First, we introduce a new set of interrelated datasets that augment the original BARD videos with dense visual annotations. This includes detection data for key entities (ball, hoop, referee, player), team attribution based on uniform colors and JNR, all integrated to directly support and enrich the original action captions. Second, we establish a comprehensive benchmark for these specific visual understanding tasks using representative state-of-the-art models. We evaluate YOLO and RF-DETR for object detection; CLIP, SigLIP2, FashionCLIP, and the Perception Encoder for team color attribution; and olmOCR, Qwen2.5-VL-3B, and Qwen2.5-VL-7B for JNR. Finally, we propose a holistic, integrated approach based on Qwen2.5-VL, demonstrating the capacity of a unified multimodal framework to jointly address all subtasks simultaneously. Ultimately, E-BARD provides a comprehensive benchmark for multi-task basketball video understanding.

Model Card for EBQwen2.5-VL-3B (E-BARD)

This repository hosts the EBQwen2.5-VL-3B model, a Vision-Language Model fine-tuned on the E-BARD (Extended Basketball Action Recognition Dataset) and BARD datasets.

EBQwen2.5-VL-3B

Unlike specialized, single-task models, EBQwen 3B was developed for end-to-end basketball scene understanding, enabling holistic multi-task learning. It can handle multiple basketball-related computer vision tasks within a single unified framework.

Model Details

Developer: Gabriele Giudici (Author of E-BARD)
Model Type: Vision-Language Model (VLM)
Base Model: Qwen 2.5 VL Instruct 3B (~3B parameters)
Capabilities: Grounding, classification, OCR, video understanding
License: CC-BY-4.0
Finetuned From: GabrieleGiudici/BQwen2.5-VL-3B

Model Sources

Code Repository: E-BARD GitHub
Original BARD Repository: BARD GitHub
Dataset Repository: Hugging Face Datasets : https://huggingface.co/datasets/GabrieleGiudici/E-BARD-*
Paper: E-BARD: A Multi-Task Extension of the Basketball Action Recognition Dataset for Player Detection, Team Attribution and Jersey Number Recognition

Uses

Direct Use

EBQwen2.5-VL-3B can jointly perform multiple basketball-related vision tasks:

Object Detection / Grounding: Detect basketballs, hoops, players, and referees
Team Color Attribution: Classify player jersey colors
Jersey Number Recognition (JNR): Identify player numbers across single or multiple frames
Action Recognition: Caption fine-grained and coarse basketball events (e.g., 3PT shot, assists)

Downstream Use

Integration into sports analytics pipelines
Tactical analysis and automated highlights generation
Comprehensive game understanding

Bias, Risks, and Limitations

Evaluated on 720p footage downscaled to 704×704; performance may drop on lower resolutions or different aspect ratios
Derived from 2024–2025 NBA footage; may be biased toward NBA-specific court layouts, camera angles, lighting, and uniforms
Multi-Object Detection: While effective in reasoning, dense multi-object detection is challenging. Predicting complex bounding boxes in fast-paced basketball environments may yield lower recall than specialized CNN detectors

Training Details

Training Data

Combined BARD and E-BARD datasets from 60 full NBA games (2024–2025 season)
Annotations:
- Action recognition captions
- Object detection (22,210 annotations: players, referees, hoops, balls)
- Team color attribution (15,295 annotations)
- Jersey number frame stubs

Training Procedure

Method: Supervised Fine-Tuning (SFT) with cross-entropy loss
Data Strategy: Interleaved multi-task training across modalities (action recognition, detection, team attribution, JNR) to mitigate catastrophic forgetting and improve generalization

Evaluation & Results

Evaluated on the 10% held-out test splits of E-BARD and human validated BARD benchmarks https://github.com/GabrieleGiudic/BARD/tree/master/validation:

Action Recognition (BARD Benchmark)
- Exact Action Match F1: 0.6092
- Perfect Ordered Actions: 0.3750
- Jersey Color (given action): 0.7972
Team Color Attribution
- Accuracy: 0.93
- Macro F1: 0.93
- Weighted F1: 0.93
Jersey Number Recognition (JNR, multi-image)
- Overall F1 (excluding NaN): 0.7339
- Model can output ‘NaN’ if the jersey number is unreadable
Object Detection (@ IoU 0.5)
- Player: Precision 0.889 | Recall 0.306 | F1 0.456
- Referee: Precision 0.860 | Recall 0.383 | F1 0.530
- Basketball: Precision 0.245 | Recall 0.158 | F1 0.192
- Hoop: Precision 0.116 | Recall 0.069 | F1 0.087

Note: VLMs struggle with generating precise bounding box coordinates; specialized detection models may perform better for dense tracking

Loading and Inference with Hugging Face Transformers

Look at the evaluation folder https://github.com/GabrieleGiudic/E-BARD/ for running the exact benchmarking scripts as described in the E-BARD paper. Note that it points always to both https://github.com/GabrieleGiudic/E-BARD and https://github.com/GabrieleGiudic/BARD.

Downloads last month: 174

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GabrieleGiudici/EBQwen2.5-VL-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

GabrieleGiudici/BQwen2.5-VL-3B