How often does our AI agree with humans?
Every employer who views a ProveIQ-evaluated submission is asked to independently rate the same work — before they see Claude's score. We track the delta, publish the rate, and don't hide bad days. Our target is ≥80% sustained. This is the single number that matters in the first 90 days.
How we measure it
- 1Candidate submits work. The submission enters our pipeline and is scored by our frontier evaluation layer against the employer's rubric.
- 2Employer is prompted to rate independently. Before the AI score is revealed, the employer scores the submission themselves on the same 0-100 scale.
- 3We compute the delta. Agreement = |AI score − employer score| ≤ 1.0 points. Both scores are stored in
EmployerScoreRating. - 4We publish the rate. Rolling 90-day window. Per-domain breakdown available to employers in
/admin/agreement-rate. Public aggregate at/api/public/agreement-rate.
What the thresholds mean
≥80%
PMF gate met. AI scoring is trusted by employers. Marketing spend unfrozen.
75–80%
Warning band. Per-domain investigation. No marketing scale-up.
<75%
Alert. Domain flagged for AI retraining. Marketing spend reassessed.
Why publish this?
The only durable proof that AI evaluation works is showing the delta between AI and the humans who also graded the same work. Every other metric — signups, NPS, headline scores — is downstream of this one.
We publish the rate even on bad days. Constitution §6.2 mandates it. If the rate drops, we tell you, we investigate, and we publish the fix.