Visual reasoning benchmark for reading analog gauges
GaugeBench evaluates whether multimodal models can accurately interpret real-world gauges under challenging visual conditions. We provide curated analog gauge images, scoring scripts, and a transparent leaderboard to track frontier model progress over time.
Leaderboard
See how leading multimodal models perform on GaugeBench. Scores reflect exact-match accuracy on gauge reading tasks.
Powered by Onvo AI
Dataset
GaugeBench covers consumer, industrial, and scientific gauges spanning psi, bar, kPa, and custom scales. We emphasise lighting variety, occlusions, and reflections to stress-test visual reasoning.

Pressure Gauges
Classic dial-style gauges with varying needle positions, glare, and bezels sourced from real devices.

Industrial Panels
High range indicators with dual units and multi-needle layouts to probe parsing of dense readouts.

Low-Light Scenarios
Dimly lit or partially obscured gauges challenging models to reason under visual noise.

Custom Scales
Non-linear and color-coded scales requiring reasoning beyond uniform tick spacing.
Questions we ask
- Exact readout: Does the model return the precise gauge value and units?
- Range awareness: Are minimum and maximum dial ranges identified correctly?
- Consistency: Can the model produce structured JSON suitable for automatic scoring?
- Robustness: Do predictions remain stable under reflections, blur, and cropped images?
Try it yourself
Download the dataset and evaluation script to benchmark your own multimodal model.
Partner with GaugeBench
We welcome research collaborations and eval contributions. Reach out if you are integrating GaugeBench into internal eval suites or would like to feature your model results on the public leaderboard.