FIBER

Fine-grained Video-Text Retrieval: A New Benchmark and Method

1State Key Laboratory for Novel Software Technology, Nanjing University, 2Shanghai AI Laboratory
comparison of coarse-grained and fine-grained video-text retrieval

A comparison of coarse-grained and fine-grained video-text retrieval.

Abstract

The ability to perceive fine-grained spatial and temporal information is crucial for video-language retrieval. However, popular video retrieval benchmarks, such as MSRVTT and MSVD, fail to efficiently evaluate the fine-grained retrieval ability of VLMs due to a lack of detailed annotations. As a result, VLMs, leveraging strong coarse-grained retrieval ability, are not under full assessment for fine-grained video retrieval.

To address this problem, we present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1000 videos sourced from FineAction. Uniquely, our fine-grained benchmark provides detailed human-annotated spatial annotations and temporal annotations, making it possible to independently evaluate the spatial and temporal bias of VLMs according to the performance. Besides, we employ a text embedding method to unlock the fine-grained video-language understanding of MLLMs.

Surprisingly, the experiment results show that our Video Large Language Embedder (VLLE) performs comparably to CLIP-based models on traditional benchmarks and has a stronger capability of fine-grained representation with smaller spatial-temporal bias.

Leaderboard

#L: Context length          R1: Recall@1          R5 Recall@5          R10 Recall@10         

By default, this leaderboard is sorted by R@1 score. To view other sorted results, please click on the corresponding cell.

# Model Params #L Date FIBER FIBER-S FIBER-T
Text to Video Video to Text Text to Video Video to Text Text to Video Video to Text
R1 R5 R10 R1 R5 R10 R1 R5 R10 R1 R5 R10 R1 R5 R10 R1 R5 R10
InternVideo2stage2

Shanghai AI Lab

1B 512 2024/04/25 72.5 93.7 97.3 69.5 94.6 97.8 72.4 94.2 97.4 62.7 90.5 95.9 46.0 80.8 91.9 46.6 82.5 92.5
InternVL2 (P)

Shanghai AI Lab

8B 2024/07/04 71.6 92.2 97.0 71.6 92.8 97.0 76.1 94.1 97.6 74.3 94.5 97.6 46.8 76.8 89.1 46.1 77.5 89.5
MiniCPM-V 2.6 (P)

OpenBMB

8B 2024/08/06 71.0 92.2 97.0 69.3 92.8 97.1 71.7 93.6 98.0 67.6 92.3 97.7 50.5 82.9 92.1 46.1 80.9 93.3
InternVL2 (A)

Shanghai AI Lab

8B 2024/07/04 68.8 90.6 95.8 62.3 87.6 93.2 71.2 92.4 96.3 66.8 89.8 94.6 42.6 76.8 87.7 41.8 74.0 86.6
LLaVA NeXT Video (P)

LLaVA NeXT Team

7B 2024/05/10 66.9 89.4 96.0 62.7 89.2 95.4 68.0 92.0 96.2 65.0 90.0 95.9 43.3 76.9 88.9 40.1 75.4 88.7
LanguageBind

Peking University

528M 77 2023/10/07 64.3 91.0 96.3 59.5 88.0 95.0 64.7 90.8 96.8 61.0 87.2 94.5 39.8 77.3 90.5 42.2 77.6 91.7
Long-CLIP L/14

Shanghai AI Lab

428M 248 2024/03/22 62.7 88.8 95.7 60.3 88.8 94.9 65.6 90.9 96.0 61.0 88.3 94.4 33.2 68.8 81.6 34.5 71.9 86.6
MiniCPM-V 2.6 (A)

OpenBMB

8B 2024/08/06 62.7 90.0 95.8 58.8 88.9 95.7 63.6 90.5 96.0 62.4 90.3 96.2 44.9 80.8 91.2 41.2 77.9 90.6
Long-CLIP B/16

Shanghai AI Lab

150M 248 2024/03/22 59.2 85.3 92.1 55.8 84.7 92.9 62.5 86.0 92.7 53.8 84.1 92.7 32.0 65.4 79.3 29.7 67.3 84.1
LLaVA NeXT Video (A)

LLaVA NeXT Team

7B 2024/05/10 52.2 84.3 91.3 53.7 82.9 90.5 57.0 86.1 94.1 55.0 83.5 91.9 36.1 70.6 84.7 34.7 67.5 83.1
CLIP L/14

OpenAI

428M 77 2021/02/26 51.2 83.4 90.6 54.7 86.9 93.6 49.0 81.9 91.4 55.4 85.6 93.0 33.5 70.3 84.0 39.7 76.2 87.9
CLIP B/16

OpenAI

150M 77 2021/02/26 45.7 79.6 89.1 48.4 82.4 90.8 45.6 79.0 89.2 47.6 80.9 90.8 30.3 65.1 79.8 35.8 71.0 85.8

Date indicates the release date of open-source models          ∞ indicates a very large number

BibTeX

@misc{xu2024finegrainedvideotextretrievalnew,
  title={Fine-grained Video-Text Retrieval: A New Benchmark and Method}, 
  author={Yifan Xu and Xinhao Li and Yichun Yang and Rui Huang and Limin Wang},
  year={2024},
  eprint={2501.00513},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2501.00513}, 
}
}