Question about the evaluation metrics for captioning benchmarks

#3
by ygyjrc - opened

Hi, thanks for releasing Marlin-2B and the evaluation results. I have a question regarding the metric used in the leaderboard figures.
In the captioning plots, the y-axis is labeled as “VideoEvalV2 mean / 10” for benchmarks such as DREAM-1K and CaReBench. I noticed that the reported scores do not match the official leaderboard scores, which use the Recall/Precision/F1 metric.
Could you clarify: What exactly is “VideoEvalV2”?
I’m very interested in video caption tasks, so I’d really appreciate any clarification.

Nemo Station org

this week we are releasing a series of blog post on what benchmarks we are using and our whole journey that will shed more light on the benchmarks and "VideoEvalV2" is our benchmarks where we used llm as a judge on videos directly rather than using text based ground truth as they were used in carbench and dream-1k

Nemo Station org

Here is our blog post series: https://nemostation.com/blog/marlin-2b-the-map-was-wrong
will release the benchmark by the end of the blog series.

Sign up or log in to comment