SEAM: Semantically Equivalent Across Modalities Benchmark

Abstract

Evaluating whether vision–language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains with existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notations and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

Leaderboard

Interactive leaderboard of VLMs across language (L), vision (V), and vision-language (VL) modalities. Click column headers to sort. Models are initially sorted by L-V agreement.

Model	L Accuracy	V Accuracy	VL Accuracy	Avg Accuracy	L-V Agreement	L-VL Agreement	V-VL Agreement	All Agreement
Proprietary Models
GPT-5-mini	0.787	0.653	0.830	0.756	0.630	0.846	0.653	0.584
GPT-5	0.804	0.632	0.857	0.765	0.627	0.876	0.657	0.596
Claude-3.7-Sonnet	0.743	0.591	0.679	0.671	0.594	0.715	0.624	0.506
Claude-4.1-Opus	0.827	0.578	0.814	0.740	0.575	0.844	0.580	0.523
Claude-4-Sonnet	0.808	0.545	0.803	0.719	0.569	0.834	0.566	0.508
Claude-3.5-Sonnet	0.665	0.560	0.514	0.580	0.537	0.549	0.508	0.378
GPT-4o	0.635	0.482	0.627	0.581	0.503	0.686	0.532	0.410
GPT-5-nano	0.699	0.510	0.753	0.654	0.500	0.771	0.516	0.432
GPT-4o-mini	0.555	0.411	0.529	0.498	0.480	0.650	0.518	0.379
Claude-3.5-Haiku	0.530	0.433	0.496	0.486	0.479	0.556	0.534	0.346
Open-Source Models
Qwen2.5-VL-72B-Instruct	0.547	0.475	0.519	0.514	0.447	0.504	0.532	0.318
InternVL3-78B	0.525	0.427	0.482	0.478	0.447	0.498	0.487	0.293
gemma-3-27b-it	0.516	0.428	0.450	0.465	0.447	0.497	0.575	0.325
gemma-3-12b-it	0.458	0.401	0.429	0.429	0.419	0.474	0.543	0.297
InternVL-2.5-78B	0.448	0.414	0.459	0.440	0.415	0.485	0.523	0.309
InternVL3-8B	0.382	0.357	0.386	0.375	0.388	0.425	0.456	0.229
Llama-3.2-90B-Vision-Instruct	0.434	0.384	0.439	0.419	0.384	0.460	0.443	0.253
Qwen2.5-Omni-7B	0.363	0.354	0.364	0.360	0.353	0.375	0.375	0.183
Qwen2.5-VL-7B-Instruct	0.303	0.350	0.359	0.337	0.347	0.389	0.437	0.216
InternVL-2.5-8B	0.324	0.337	0.334	0.332	0.324	0.340	0.436	0.196
Llama-3.2-11B-Vision-Instruct	0.289	0.330	0.323	0.314	0.287	0.303	0.401	0.152

Accuracy vs Agreement

Correlation between final answer agreement rates with language and vision inputs and average accuracy across various models. Models are color-coded by family. Random Baseline denotes how often two models with identical accuracy would agree by random chance, which can be thought of as a lower bound for cross-modal agreement of real multimodal models. Cross-modal answer agreement is relatively low, i.e., often not far from the random baseline, suggesting that models differ substantially in how they process information across modalities, and have substantial room to improve in integrating reasoning and leveraging abilities across representations.

BibTeX

@inproceedings{ tang2025seam, title={{SEAM}: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models}, author={Zhenwei Tang and Difan Jiao and Blair Yang and Ashton Anderson}, booktitle={Second Conference on Language Modeling}, year={2025}, url={https://openreview.net/forum?id=lI4LgGv4sX} }

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Abstract

Overview

SEAM includes 16 tasks in chess, chemistry, music, and graph theory domains with paired visual-spatial and textual-symbolic representations that are semantically equivalent.

Leaderboard

Accuracy vs Agreement

BibTeX