Architecture of XGC-AVis Agent.
| Methods | A/V Recognition |
A/V Localization |
A/V Perception |
A/V Reasoning |
Average | ||||
|---|---|---|---|---|---|---|---|---|---|
| Closed-Source MLLMs | |||||||||
| XGC-AVis (ours) | 59.5 | 51.7 | 51.0 | 69.3 | 58.2 | ||||
| Gemini 2.0 Flash | 54.4 | 49.3 | 40.2 | 65.9 | 51.4 | ||||
| ChatGPT-4o | 44.9 | 47.9 | 36.6 | 60.1 | 46.7 | ||||
| Claude 3.7 Sonnet | 28.7 | 29.5 | 15.1 | 49.6 | 29.8 | ||||
| Open-source OLMs | |||||||||
| Qwen2.5-Omni (7B) | 55.4 | 37.5 | 45.1 | 57.9 | 49.8 | ||||
| VideoLLaMA2 (7B) | 48.0 | 41.7 | 40.7 | 40.2 | 41.4 | ||||
| Unified-IO-2 XXL (8B) | 36.5 | 27.4 | 34.2 | 33.0 | 33.3 | ||||
| GroundingGPT (7B) | 40.5 | 37.8 | 27.2 | 36.2 | 32.8 | ||||
| PandaGPT (7B) | 38.2 | 28.8 | 28.9 | 35.4 | 32.1 | ||||
| Video-SALMONN (7B) | 34.5 | 36.5 | 26.2 | 36.0 | 31.5 | ||||
| Bubogpt (7B) | 18.2 | 14.9 | 17.4 | 14.2 | 16.2 | ||||
| VLMs with subtitles | |||||||||
| InternVL3 (8B) | 47.6 | 50.3 | 40.3 | 52.2 | 46.2 | ||||
| LLaVA-OneVision (8B) | 47.0 | 42.7 | 41.3 | 48.0 | 44.4 | ||||
| InternVL2.5 (8B) | 37.2 | 42.0 | 34.2 | 48.6 | 40.2 | ||||