Evaluating whether vision–language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains with existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notations and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.
Interactive leaderboard of VLMs across language (L), vision (V), and vision-language (VL) modalities. Click column headers to sort. Models are initially sorted by L-V agreement.
Model | L Accuracy | V Accuracy | VL Accuracy | Avg Accuracy | L-V Agreement | L-VL Agreement | V-VL Agreement | All Agreement |
---|---|---|---|---|---|---|---|---|
Proprietary Models | ||||||||
GPT-5-mini | 0.787 | 0.653 | 0.830 | 0.756 | 0.630 | 0.846 | 0.653 | 0.584 |
GPT-5 | 0.804 | 0.632 | 0.857 | 0.765 | 0.627 | 0.876 | 0.657 | 0.596 |
Claude-3.7-Sonnet | 0.743 | 0.591 | 0.679 | 0.671 | 0.594 | 0.715 | 0.624 | 0.506 |
Claude-4.1-Opus | 0.827 | 0.578 | 0.814 | 0.740 | 0.575 | 0.844 | 0.580 | 0.523 |
Claude-4-Sonnet | 0.808 | 0.545 | 0.803 | 0.719 | 0.569 | 0.834 | 0.566 | 0.508 |
Claude-3.5-Sonnet | 0.665 | 0.560 | 0.514 | 0.580 | 0.537 | 0.549 | 0.508 | 0.378 |
GPT-4o | 0.635 | 0.482 | 0.627 | 0.581 | 0.503 | 0.686 | 0.532 | 0.410 |
GPT-5-nano | 0.699 | 0.510 | 0.753 | 0.654 | 0.500 | 0.771 | 0.516 | 0.432 |
GPT-4o-mini | 0.555 | 0.411 | 0.529 | 0.498 | 0.480 | 0.650 | 0.518 | 0.379 |
Claude-3.5-Haiku | 0.530 | 0.433 | 0.496 | 0.486 | 0.479 | 0.556 | 0.534 | 0.346 |
Open-Source Models | ||||||||
Qwen2.5-VL-72B-Instruct | 0.547 | 0.475 | 0.519 | 0.514 | 0.447 | 0.504 | 0.532 | 0.318 |
InternVL3-78B | 0.525 | 0.427 | 0.482 | 0.478 | 0.447 | 0.498 | 0.487 | 0.293 |
gemma-3-27b-it | 0.516 | 0.428 | 0.450 | 0.465 | 0.447 | 0.497 | 0.575 | 0.325 |
gemma-3-12b-it | 0.458 | 0.401 | 0.429 | 0.429 | 0.419 | 0.474 | 0.543 | 0.297 |
InternVL-2.5-78B | 0.448 | 0.414 | 0.459 | 0.440 | 0.415 | 0.485 | 0.523 | 0.309 |
InternVL3-8B | 0.382 | 0.357 | 0.386 | 0.375 | 0.388 | 0.425 | 0.456 | 0.229 |
Llama-3.2-90B-Vision-Instruct | 0.434 | 0.384 | 0.439 | 0.419 | 0.384 | 0.460 | 0.443 | 0.253 |
Qwen2.5-Omni-7B | 0.363 | 0.354 | 0.364 | 0.360 | 0.353 | 0.375 | 0.375 | 0.183 |
Qwen2.5-VL-7B-Instruct | 0.303 | 0.350 | 0.359 | 0.337 | 0.347 | 0.389 | 0.437 | 0.216 |
InternVL-2.5-8B | 0.324 | 0.337 | 0.334 | 0.332 | 0.324 | 0.340 | 0.436 | 0.196 |
Llama-3.2-11B-Vision-Instruct | 0.289 | 0.330 | 0.323 | 0.314 | 0.287 | 0.303 | 0.401 | 0.152 |
@inproceedings{
tang2025seam,
title={{SEAM}: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models},
author={Zhenwei Tang and Difan Jiao and Blair Yang and Ashton Anderson},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=lI4LgGv4sX}
}