Confidence Scoring

TLDR

Score every completed task out of 10 with evidence. 8/10 minimum to move on. Below that? Fix it before starting the next task. Confidence scoring prevents cascade failure — broken code building on broken code until the whole thing is a mess.

The Format

## Confidence: 8/10

**Completed:**
- [x] POST /auth/login working
- [x] POST /auth/register working
- [x] Error handling for 401/409/422
- [x] Unit tests passing (8/8)
- [x] Smoke tested in browser

**Deferred (by design, not failure):**
- [ ] Rate limiting → Sprint 2

**Concerns:**
- Token refresh not implemented yet (not in scope)

The Scale

Score	Meaning	Action
10/10	Everything working, fully tested, no concerns	Move on
8–9/10	Core requirements met, minor gaps noted	Move on
6–7/10	Partial completion, some requirements missing	Fix before next task
Below 6	Significant problems, unstable foundation	Stop, fix, rescore

Why 8/10, Not Lower

Everything in software depends on things that came before. Broken auth = broken everything that needs auth. Incorrect schema = bad data everywhere. 7/10 on Task 3 compounds to 4/10 by Task 10.

The 8/10 rule keeps compounding errors from snowballing.

What Counts Toward the Score

High-weight items:

Core requirements met (from the task spec)
Tests written and passing
Smoke test or browser test done
Error cases handled

Doesn't lower score:

Features explicitly deferred to a future sprint
Nice-to-haves that weren't in the spec

When AI Overscores

AI sometimes gives 9/10 when the reality is 6/10. Challenge it:

"You gave 9/10 but the error handling isn't complete and there are no tests for the edge cases. What would it take to honestly get to 8/10?"

AI will usually acknowledge the gap and either do the work or provide a more honest score. Your verification matters — don't skip running the app and checking things manually.

Scoring Your Own Work

You can also score tasks yourself after review:

This looks like 7/10 to me. The login works but
error messages are not specific enough and I see no
tests for the token expiry case. Can you address those?

AI will fix the gaps. Then rescore. This is the feedback loop that keeps quality consistent.