XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System

Anonymous authors

XGC-AVis Agent

Architecture of XGC-AVis Agent.

Methods	A/V Recognition	A/V Localization	A/V Perception	A/V Reasoning	Average
Closed-Source MLLMs
XGC-AVis (ours)	59.5	51.7	51.0	69.3	58.2
Gemini 2.0 Flash	54.4	49.3	40.2	65.9	51.4
ChatGPT-4o	44.9	47.9	36.6	60.1	46.7
Claude 3.7 Sonnet	28.7	29.5	15.1	49.6	29.8
Open-source OLMs
Qwen2.5-Omni (7B)	55.4	37.5	45.1	57.9	49.8
VideoLLaMA2 (7B)	48.0	41.7	40.7	40.2	41.4
Unified-IO-2 XXL (8B)	36.5	27.4	34.2	33.0	33.3
GroundingGPT (7B)	40.5	37.8	27.2	36.2	32.8
PandaGPT (7B)	38.2	28.8	28.9	35.4	32.1
Video-SALMONN (7B)	34.5	36.5	26.2	36.0	31.5
Bubogpt (7B)	18.2	14.9	17.4	14.2	16.2
VLMs with subtitles
InternVL3 (8B)	47.6	50.3	40.3	52.2	46.2
LLaVA-OneVision (8B)	47.0	42.7	41.3	48.0	44.4
InternVL2.5 (8B)	37.2	42.0	34.2	48.6	40.2