VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Yufan Ren^1*, Konstantinos Tertikas², Shalini Maiti^3,4, Junlin Han^3,5
Tong Zhang¹, Sabine Süsstrunk¹, Filippos Kokkinos³

¹School of Computer and Communication Sciences, EPFL ²National and Kapodistrian University of Athens ³Meta GenAI ⁴University College London ⁵University of Oxford

* Work done during Yufan's internship at Meta

Paper Supplementary Evaluation Code Evaluation with VLMEvalKit Dataset arXiv

What is VGRP-Bench?

VGRP-Bench is a benchmark containing 20 visual grid reasoning puzzles with diverse difficulty levels that pose significant challenges for current Large Vision-Language Models (LVLMs).

Benchmark Overview. (a) We present a benchmark for LVLMs with 20 diverse visual grid reasoning puzzles. (b) We evaluate top models (GPT-4o, Gemini, Llama 3.2, Gemini-Thinking) on perception, puzzle-solving, and rule-following. We also explore improvement techniques: (c) Solution Supervised Fine-Tuning (S-SFT) and (d) Reasoning Supervised Fine-Tuning (R-SFT).

Abstract

Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning — an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce \textbf{VGRP-Bench}, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving.

Puzzles with diverse rules

1. Aquarium: You need to fill the aquariums with water up to a certain level or leave it empty. The numbers on the sides indicate how many filled (water) cells must be in each row and column.

2. Battle-Ships: You need to place ships in a grid based on row and column hints. The hints indicate how many ship cells are in each row and column. Ships cannot touch each other, even diagonally.

3. Binairo: You have to fill a grid with white (w) and black (b) pieces. No more than two circles of the same color can be adjacent (horizontally and vertically).

4. Colored-Sudoku: You have to fill in an NxN grid with digits from 1 to N. Each row, column, and block must have unique digits. Additionally, cells with the same color must have the same digit, and cells with different colors must have different digits.

5. Field-Explorer: You need to identify mine locations in a grid based on revealed numbers. Each revealed number indicates how many mines are adjacent to that cell (including diagonals).

6. Futoshiki (Unequal): You have to enter a numerical digit from 1 through N in each cell of an NxN grid. The rules are: unique numbers in each row and column; inequality signs between cells must be respected.

7. Hitori: You need to shade some cells in the grid such that no number appears more than once in each row and column among unshaded cells. The rules are: shaded cells cannot be adjacent; all unshaded cells must be connected.

8. Jigsaw-Sudoku: You have to enter a numerical digit from 1 through N in each cell of a NxN grid. The rules are: unique numbers in each row, column, and within cells of the same region. Each region is a connected group of cells.

9. Kakurasu: You need to shade some cells in a grid where the sum of the weights of selected cells in each row and column matches the given clues. The weights increase from left to right (for rows) and top to bottom (for columns), starting from 1.

10. Kakuro: You have to fill in the grid with numbers (1 to N) such that each row and column adds up to the specified sum. The rules are: adjacent numbers should not be the same; numbers add up to the given sum for each row and column.

11. Killer-Sudoku: You have to enter a numerical digit from 1 through N in each cell of an NxN grid. The board is divided into cages based on cell color. The rules are: unique numbers in each row, column, and each sqrt(N)xsqrt(N) block, and the sum of numbers in each cage must equal to the target sum.

12. Light-Up: You have to place light bulbs in the grid such that all empty cells are illuminated. Light bulbs illuminate their entire row and column until blocked by a wall; numbered walls must have exactly that many bulbs adjacent to them; bulbs cannot illuminate each other.

13. Nonogram: You need to fill in cells in a grid based on numbers at the side of the grid. For each row or column, the numbers indicate the lengths of consecutive shaded cells in that row/column, which must appear in the given order.

14. Odd-Even-Sudoku: You have to enter a numerical digit from 1 through N in each cell of a NxN grid. The rules are: unique numbers in each row, column, and sqrt(N)xsqrt(N) block. Additionally, white cells must contain even numbers, and black cells must contain odd numbers.

15. Renzoku (Neighbors): You have to enter a numerical digit from 1 through N in each cell of an NxN grid. The rules are: unique numbers in each row, column; A dot between 2 cells indicates that those 2 numbers should be consecutive. Otherwise, the numbers should be non-consecutive.

16. Skyscraper: You have to enter a numerical digit from 1 through N in each cell of an NxN grid. The numbers indicate the height of the skyscrapers. The numbers on the sides of the grid indicate how many skyscrapers would you see if you look in the direction of the arrow.

17. Star-Battle: You have to place stars on the grid such that each row, column, and region contains exactly one star. Additional rule is: stars cannot touch each other, not even diagonally.

18. Sudoku: You have to enter a numerical digit from 1 through N in each cell of a NxN grid made up of four sqrt(N)xsqrt(N) blocks. The rule is to make sure unique numbers in each row, column, and block.

19. Thermometers: You need to fill thermometers. The numbers on the sides indicate how many filled cells must be in each row and column. In the end, all thermometers must be filled from their bulb (start) to their top, without gaps.

20. Trees-and-Tents: You need to place tents on a grid with trees. Each tree must be paired with exactly one tent that is horizontally or vertically adjacent to it (a 1-to-1 relationship). Tents cannot touch each other, even diagonally. The numbers on the sides indicate how many tents must be in each row and column.

Qualitative Results

Example model outputs on various puzzles from VGRP-Bench.

Result Summary on Easy Level. Puzzle-solving rate of state-of-the-art chat LVLMs on easy-level puzzles associated with each rule. Please refer to the experiment section for detailed result analysis. Note that this plot's score ranges from 0 to 45%, instead of 100%.

Details on these rule axis could be found in the paper. For example, S.S. represents Sum and Subtraction, and Match. represents the capability to match two components.

Interactive Demo

Try out our interactive Sudoku puzzle demo below to experience how LVLMs approach these puzzles. You can play the puzzles yourself and compare your reasoning with the model's approach.

BibTeX

If you find this work useful in your research, please consider citing the following BibTeX entry:

@article{ren2025vgrp-bench,
        title={VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models},
        author={Yufan Ren and Konstantinos Tertikas and Shalini Maiti and Junlin Han and Tong Zhang and Sabine Süsstrunk and Filippos Kokkinos},
        journal={arXiv preprint arXiv:2503.23064},
        year={2025}
    }