Abstract

This work builds upon the Basketball Action Recognition Dataset (BARD), originally introduced to enable supervised learning for primary action recognition in NBA game footage. However, BARD's initial design lacks the granular annotations required to develop multi-stage computer vision pipelines involving object detection, jersey number recognition (JNR) and team attribution. To address these limitations, we present E-BARD (Extended Basketball Action Recognition Dataset), which bridges the gap between isolated action recognition and end-to-end scene-level reasoning through three key contributions.First, we introduce a new set of interrelated datasets that augment the original BARD videos with dense visual annotations. This includes detection data for key entities (ball, hoop, referee, player), team attribution based on uniform colors and JNR, all integrated to directly support and enrich the original action captions. Second, we establish a comprehensive benchmark for these specific visual understanding tasks using representative state-of-the-art models. We evaluate YOLO and RF-DETR for object detection; CLIP, SigLIP2, FashionCLIP, and the Perception Encoder for team color attribution; and olmOCR, Qwen2.5-VL-3B, and Qwen2.5-VL-7B for JNR. Finally, we propose a holistic, integrated approach based on Qwen2.5-VL, demonstrating the capacity of a unified multimodal framework to jointly address all subtasks simultaneously. Ultimately, E-BARD provides a comprehensive benchmark for multi-task basketball video understanding.

Model Card for EBQwen2.5-VL-3B (E-BARD)

This repository hosts the EBQwen2.5-VL-3B model, a Vision-Language Model fine-tuned on the E-BARD (Extended Basketball Action Recognition Dataset) and BARD datasets.

EBQwen2.5-VL-3B

Unlike specialized, single-task models, EBQwen 3B was developed for end-to-end basketball scene understanding, enabling holistic multi-task learning. It can handle multiple basketball-related computer vision tasks within a single unified framework.

Model Details

  • Developer: Gabriele Giudici (Author of E-BARD)
  • Model Type: Vision-Language Model (VLM)
  • Base Model: Qwen 2.5 VL Instruct 3B (~3B parameters)
  • Capabilities: Grounding, classification, OCR, video understanding
  • License: CC-BY-4.0
  • Finetuned From: GabrieleGiudici/BQwen2.5-VL-3B

Model Sources


Uses

Direct Use

EBQwen2.5-VL-3B can jointly perform multiple basketball-related vision tasks:

  • Object Detection / Grounding: Detect basketballs, hoops, players, and referees
  • Team Color Attribution: Classify player jersey colors
  • Jersey Number Recognition (JNR): Identify player numbers across single or multiple frames
  • Action Recognition: Caption fine-grained and coarse basketball events (e.g., 3PT shot, assists)

Downstream Use

  • Integration into sports analytics pipelines
  • Tactical analysis and automated highlights generation
  • Comprehensive game understanding

Bias, Risks, and Limitations

  • Evaluated on 720p footage downscaled to 704×704; performance may drop on lower resolutions or different aspect ratios
  • Derived from 2024–2025 NBA footage; may be biased toward NBA-specific court layouts, camera angles, lighting, and uniforms
  • Multi-Object Detection: While effective in reasoning, dense multi-object detection is challenging. Predicting complex bounding boxes in fast-paced basketball environments may yield lower recall than specialized CNN detectors

Training Details

Training Data

  • Combined BARD and E-BARD datasets from 60 full NBA games (2024–2025 season)
  • Annotations:
    • Action recognition captions
    • Object detection (22,210 annotations: players, referees, hoops, balls)
    • Team color attribution (15,295 annotations)
    • Jersey number frame stubs

Training Procedure

  • Method: Supervised Fine-Tuning (SFT) with cross-entropy loss
  • Data Strategy: Interleaved multi-task training across modalities (action recognition, detection, team attribution, JNR) to mitigate catastrophic forgetting and improve generalization

Evaluation & Results

Evaluated on the 10% held-out test splits of E-BARD and human validated BARD benchmarks https://github.com/GabrieleGiudic/BARD/tree/master/validation:

  1. Action Recognition (BARD Benchmark)

    • Exact Action Match F1: 0.6092
    • Perfect Ordered Actions: 0.3750
    • Jersey Color (given action): 0.7972
  2. Team Color Attribution

    • Accuracy: 0.93
    • Macro F1: 0.93
    • Weighted F1: 0.93
  3. Jersey Number Recognition (JNR, multi-image)

    • Overall F1 (excluding NaN): 0.7339
    • Model can output ‘NaN’ if the jersey number is unreadable
  4. Object Detection (@ IoU 0.5)

    • Player: Precision 0.889 | Recall 0.306 | F1 0.456
    • Referee: Precision 0.860 | Recall 0.383 | F1 0.530
    • Basketball: Precision 0.245 | Recall 0.158 | F1 0.192
    • Hoop: Precision 0.116 | Recall 0.069 | F1 0.087

Note: VLMs struggle with generating precise bounding box coordinates; specialized detection models may perform better for dense tracking


Loading and Inference with Hugging Face Transformers

Look at the evaluation folder https://github.com/GabrieleGiudic/E-BARD/ for running the exact benchmarking scripts as described in the E-BARD paper. Note that it points always to both https://github.com/GabrieleGiudic/E-BARD and https://github.com/GabrieleGiudic/BARD.

Downloads last month
174
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GabrieleGiudici/EBQwen2.5-VL-3B

Finetuned
(1)
this model
Quantizations
1 model

Datasets used to train GabrieleGiudici/EBQwen2.5-VL-3B