VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

1School of Computer and Communication Sciences, EPFL 2National and Kapodistrian University of Athens 3Meta GenAI 4University College London 5University of Oxford
* Work done at Meta as an intern

What is VGRP-Bench?

VGRP-Bench is a benchmark containing 20 visual grid reasoning puzzles with diverse difficulty levels that pose significant challenges for current Large Vision-Language Models (LVLMs).

MY ALT TEXT

Benchmark Overview. (a) We present a benchmark for LVLMs with 20 diverse visual grid reasoning puzzles. (b) We evaluate top models (GPT-4o, Gemini, Llama 3.2, Gemini-Thinking) on perception, puzzle-solving, and rule-following. We also explore improvement techniques: (c) Solution Supervised Fine-Tuning (S-SFT) and (d) Reasoning Supervised Fine-Tuning (R-SFT).

Abstract

Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning — an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce \textbf{VGRP-Bench}, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving.

Puzzles with diverse rules

Qualitative Results

Qualitative Results

Example model outputs on various puzzles from VGRP-Bench.

Result Summary on Easy Level. Puzzle-solving rate of state-of-the-art chat LVLMs on easy-level puzzles associated with each rule. Please refer to the experiment section for detailed result analysis. Note that this plot's score ranges from 0 to 45%, instead of 100%.

Details on these rule axis could be found in the paper. Sum and Subtraction (S.S.), (Comp), Unidirectionality (Uni.), Hard Perception (Non-desc), Matching (Match.), Connected Components (Connect.)

e

Interactive Demo

Try out our interactive Sudoku puzzle demo below to experience how LVLMs approach these puzzles. You can play the puzzles yourself and compare your reasoning with the model's approach.

BibTeX

If you find this work useful in your research, please consider citing the following BibTeX entry:

@article{ren2025vgrp-bench,
        title={VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models},
        author={Yufan Ren and Konstantinos Tertikas and Shalini Maiti and Junlin Han and Tong Zhang and Sabine Süsstrunk and Filippos Kokkinos},
        journal={arXiv preprint arXiv:xxxx.xxxxx},
        year={2025}
    }