Question about the evaluation metrics for captioning benchmarks

by ygyjrc - opened 18 days ago

Hi, thanks for releasing Marlin-2B and the evaluation results. I have a question regarding the metric used in the leaderboard figures.
In the captioning plots, the y-axis is labeled as “VideoEvalV2 mean / 10” for benchmarks such as DREAM-1K and CaReBench. I noticed that the reported scores do not match the official leaderboard scores, which use the Recall/Precision/F1 metric.
Could you clarify: What exactly is “VideoEvalV2”?
I’m very interested in video caption tasks, so I’d really appreciate any clarification.

rethinkNow

Nemo Station org 12 days ago

this week we are releasing a series of blog post on what benchmarks we are using and our whole journey that will shed more light on the benchmarks and "VideoEvalV2" is our benchmarks where we used llm as a judge on videos directly rather than using text based ground truth as they were used in carbench and dream-1k

rethinkNow

Nemo Station org 9 days ago

Here is our blog post series: https://nemostation.com/blog/marlin-2b-the-map-was-wrong
will release the benchmark by the end of the blog series.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment