Video-MME

Introduction

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models with an average accuracy of 75%, compared to 71.9% for GPT-4o. The results also demonstrate that Video-MME is a universal benchmark, which applies to both image and video MLLMs. Further analysis indicates that subtitle and audio information could significantly enhance video understanding. Besides, a decline in MLLM performance is observed as video duration increases for all models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data, shedding light on future MLLM development.

Leaderboard

Accuracy scores on Video-MME are presented for short, medium, and long videos, taking the corresponding subtitles as input or not.

Short Video: < 2min Medium Video: 4min ~ 15min Long Video: 30min ~ 60min

By default, this leaderboard is sorted by results with subtitles. To view other sorted results, please click on the corresponding cell.

#	Model	LLM Params	Frames	Date	Overall (%)		Short Video (%)		Medium Video (%)		Long Video (%)
#	Model	LLM Params	Frames	Date	w/o subs	w subs	w/o subs	w subs	w/o subs	w subs	w/o subs	w subs
	Gemini 1.5 Pro Google	-	1/0.5 fps^1*	2024-06-15	75.0	81.3	81.7	84.5	74.3	81.0	67.4	77.4
	Qwen2-VL Alibaba	72B	768^3*	2024-08-19	71.2	77.8	80.1	82.2	71.3	76.8	62.2	74.3
	GPT-4o OpenAI	-	384^2*	2024-06-15	71.9	77.2	80.0	82.8	70.3	76.6	65.3	72.1
	LLaVA-Video Bytedance & NTU S-Lab	72B	64	2024-08-28	70.6	76.9	81.4	82.8	68.9	75.6	61.5	72.5
	Gemini 1.5 Flash Google	-	1/0.5 fps^1*	2024-06-15	70.3	75.0	78.8	79.8	68.8	74.7	61.1	68.8
	Oryx-1.5 THU & Tencent & NTU	34B	128	2024-10-21	67.3	74.9	77.3	80.6	65.3	74.3	59.3	69.9
	Aria Rhymes AI	8x3.5B	256	2024-10-11	67.6	72.1	76.9	78.3	67.0	71.7	58.8	66.3
	NVILA NVIDIA	7B	1024	2024-11-06	64.2	70.0	75.7	77.6	62.2	69.0	54.8	63.3
	LLaVA-OneVision Bytedance & NTU S-Lab	72B	32	2024-08-08	66.3	69.6	76.7	79.3	62.2	66.9	60.0	62.4
	GPT-4o mini OpenAI	-	250	2024-07-21	64.8	68.9	72.5	74.9	63.1	68.3	58.6	63.4
	ByteVideoLLM Bytedance	14B	100	2024-10-21	64.6	68.8	74.4	77.1	62.9	69.1	56.4	60.2
	mPLUG-Owl3 Alibaba	7B	128	2024-11-13	59.3	68.1	70.0	72.8	57.7	66.9	50.1	64.5
	VideoLLaMA 2 Alibaba	72B	32	2024-08-29	62.4	64.7	69.8	72.0	59.9	63.0	57.6	59.0
	MiniCPM-V 2.6 OpenBMB	8B	64	2024-08-12	60.9	63.7	71.3	73.5	59.4	61.1	51.8	56.3
	GPT-4V OpenAI	-	10	2024-06-15	59.9	63.3	70.5	73.2	55.8	59.7	53.5	56.9
	Claude 3.5 Sonnet Anthropic	-	20	2024-07-30	60.0	62.9	71.0	73.5	57.4	60.1	51.2	54.7
	TimeMarker Meituan	8B	128	2024-11-07	57.3	62.8	71.0	75.8	54.4	60.7	46.4	51.9
	InternVL2 Shanghai AI Lab	34B	16	2024-07-18	61.2	62.4	72.0	72.8	59.1	61.3	52.6	53.0
	Video-XL SJTU & BAAI	7B	128	2024-10-24	55.5	61.0	64.0	67.4	53.2	60.7	49.2	54.9
	VITA Tencent Youtu Lab & NJU	8×7B	32	2024-09-08	55.8	59.2	65.9	70.4	52.9	56.2	48.6	50.9
	Kangaroo Meituan & UCAS	8B	64	2024-07-23	56.0	57.6	66.1	68.0	55.3	55.4	46.6	49.3
	Video-CCAM QQMM	14B	96	2024-07-16	53.2	57.4	62.2	66.0	50.6	56.3	46.7	49.9
	Long-LLaVA Amazon	7B	64	2024-09-09	52.9	57.1	61.9	66.2	51.4	54.7	45.4	50.3
	LongVA NTU S-Lab	7B	128	2024-06-25	52.6	54.3	61.1	61.6	50.4	53.6	46.2	47.6
	InternVL-Chat-V1.5 Shanghai AI Lab	20B	10	2024-06-15	50.7	52.4	60.2	61.7	46.4	49.1	45.6	46.6
	Qwen-VL-Max Alibaba	-	4	2024-06-15	51.3	51.2	55.8	57.6	49.2	48.9	48.9	47.0
	ShareGemini XMU	7B	64	2024-06-20	43.2	47.9	49.1	52.8	41.3	47.3	39.1	43.4
	SliME CASIA	8B	8	2024-07-16	45.3	47.2	53.3	55.4	42.7	44.4	39.8	41.7
	Chat-UniVi-v1.5 PKU	7B	64	2024-06-15	40.6	45.9	45.7	51.2	40.3	44.6	35.8	41.8
	VideoChat2-Mistral Shanghai AI Lab	7B	16	2024-06-15	39.5	43.8	48.3	52.8	37.0	39.4	33.2	39.2
	ShareGPT4Video Shanghai AI Lab	8B	16	2024-06-17	39.9	43.6	48.3	53.6	36.3	39.3	35.0	37.9
	ST-LLM PKU	7B	64	2024-06-15	37.9	42.3	45.7	48.4	36.8	41.4	31.3	36.9
	Qwen-VL-Chat Alibaba	7B	4	2024-06-15	41.1	41.9	46.9	47.3	38.7	40.4	37.8	37.9
	Video-LLaVA PKU	7B	8	2024-06-15	39.9	41.6	45.3	46.1	38.0	40.7	36.2	38.1

Green date indicates the newly added/updated models - indicates closed-source models

1* The short and medium videos are sampled at 1 fps, while the long videos are sampled at 0.5 fps to ensure the stability of the API.
2* The videos less than 384 seconds are sampled at 1 fps, and for those longer than 384 seconds, we extract 384 frames uniformly. All the frames are resized to 512x512 resolution to fit within GPT-4o’s max context length.
3* The videos are sampled at 2 fps, and the upper limit is 768 frames.

Data Examples

Benchmark Statistics

(Left) Video Categorie Hierarchy: Video-MME consists of 6 key domains and 30 subcategories of video types.
(Right) Video Duration and Task Type Distributions: Video-MME spans a full spectrum of video lengths and assesses various core abilities of MLLMs.

Benchmark Comparison

Analysis of Certificate Length in seconds. Avg. V.L.: average video length, Med. C.L.: median certificate length, Avg. C.L.: average certificate length.

The comparison of various benchmarks encompasses several key aspects: the total number of videos, the number of clips, the average duration of the videos, the method of video annotation (manual denoted as M, automated as A), the average number of QA pair tokens, the average number of subtitle tokens, whether the videos cover multiple duration levels, whether the videos are sourced from a broad range of open domains, and whether provide subtitle together with audio information. It is important to note that if a dataset includes multiple task formats, our comparison focuses solely on the multiple-choice segment.

Different Video Duration Types

Evaluation results of Gemini 1.5 Pro.

(1) Evaluation results of Gemini 1.5 Pro across different video subcategories.

(2) Evaluation results of Gemini 1.5 Pro across different video subcategories.

(3) Evaluation results of Gemini 1.5 Pro across different video subcategories.

@article{fu2024video, title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis}, author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others}, journal={arXiv preprint arXiv:2405.21075}, year={2024} }

Video-MME

The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Statistics

Benchmark Comparison

Experiment Results

Different Question Types

Different Video Duration Types

Citation