Visual reasoning benchmark for reading analog gauges

GaugeBench evaluates whether multimodal models can accurately interpret real-world gauges under challenging visual conditions. We provide curated analog gauge images, scoring scripts, and a transparent leaderboard to track frontier model progress over time.

Read the paper View the project

Gauge IDs —

Bench Runs —

Average Accuracy —

Top Score —

Leaderboard

See how leading multimodal models perform on GaugeBench. Scores reflect exact-match accuracy on gauge reading tasks.

Dataset

GaugeBench covers consumer, industrial, and scientific gauges spanning psi, bar, kPa, and custom scales. We emphasise lighting variety, occlusions, and reflections to stress-test visual reasoning.

Pressure Gauges

Classic dial-style gauges with varying needle positions, glare, and bezels sourced from real devices.

Industrial Panels

High range indicators with dual units and multi-needle layouts to probe parsing of dense readouts.

Low-Light Scenarios

Dimly lit or partially obscured gauges challenging models to reason under visual noise.

Custom Scales

Non-linear and color-coded scales requiring reasoning beyond uniform tick spacing.

Questions we ask

Exact readout: Does the model return the precise gauge value and units?
Range awareness: Are minimum and maximum dial ranges identified correctly?
Consistency: Can the model produce structured JSON suitable for automatic scoring?
Robustness: Do predictions remain stable under reflections, blur, and cropped images?

Try it yourself

Download the dataset and evaluation script to benchmark your own multimodal model.

Get the dataset Scoring guide

Partner with GaugeBench

We welcome research collaborations and eval contributions. Reach out if you are integrating GaugeBench into internal eval suites or would like to feature your model results on the public leaderboard.

Contact the team Benchmark overview