BenchmarksMay 25, 20264 min read

Pre-Registered Held-Out Evaluation

The strictest generalization test: questions sourced after training freeze, scored under pre-specified exclusion rules.

On 18 April 2026 we sourced 22 benchmark questions from Emerson, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council across five states and four categories. This set was written down after training and calibration were frozen.

An automated pre-filter dropped 8 items (past-election ground truths, extreme prior-delta outliers). Fourteen items were scored under pre-specified rules.

Held-out results

Held-out error is intentionally higher than our best in-panel Texas numbers. We publish both because the gap reflects real-world spread between well-specified state panels and noisier edge cases.

Overall calibrated MAE: 10.68%
Ex-electoral subset: 9.97%
Political approval subset: 7.24%

Product validation

Customer-facing benchmark tables and live predictions are on lewsearch.com/methodology. Due diligence materials available on request.