XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System

Anonymous authors

XGC-AVis Agent

Baseline model

Architecture of XGC-AVis Agent.

Leaderboard

Methods A/V
Recognition
A/V
Localization
A/V
Perception
A/V
Reasoning
Average
Closed-Source MLLMs
XGC-AVis (ours) 59.5 51.7 51.0 69.3 58.2
Gemini 2.0 Flash 54.4 49.3 40.2 65.9 51.4
ChatGPT-4o 44.947.936.6 60.146.7
Claude 3.7 Sonnet 28.729.515.149.629.8
Open-source OLMs
Qwen2.5-Omni (7B) 55.437.545.157.949.8
VideoLLaMA2 (7B) 48.041.740.740.241.4
Unified-IO-2 XXL (8B) 36.527.434.233.033.3
GroundingGPT (7B) 40.537.827.236.232.8
PandaGPT (7B) 38.228.828.935.432.1
Video-SALMONN (7B) 34.536.526.236.031.5
Bubogpt (7B) 18.214.917.414.216.2
VLMs with subtitles
InternVL3 (8B) 47.650.340.352.246.2
LLaVA-OneVision (8B) 47.042.741.348.044.4
InternVL2.5 (8B) 37.2 42.034.248.640.2

QA Examples

Audio Source Recognition
Image 1
Music Recognition
Image 2
Counting
Image 3
Audio Source Localization
Image 3
Speaker Localization
Image 3
Music Localization
Image 3
A/V Content Matching
Image 3
Music Temporal Matching
Image 3
Audio Temporal Matching
Image 3
Speech Temporal Matching
Image 3
Distortion Type Classification
Image 3
Distortion Localization
Image 3
A/V Overall Quality
Image 3
Music Understanding
Image 3
Event Causal Reasoning
Image 3
Human Interaction Reasoning
Image 3
Identity Reasoning
Image 3
Audio Causal Reasoning
Image 3
A/V Prediction
Image 3
Emotion Reasoning
Image 3