TLDR
Score every completed task out of 10 with evidence. 8/10 minimum to move on. Below that? Fix it before starting the next task. Confidence scoring prevents cascade failure — broken code building on broken code until the whole thing is a mess.
The Format
## Confidence: 8/10
**Completed:**
- [x] POST /auth/login working
- [x] POST /auth/register working
- [x] Error handling for 401/409/422
- [x] Unit tests passing (8/8)
- [x] Smoke tested in browser
**Deferred (by design, not failure):**
- [ ] Rate limiting → Sprint 2
**Concerns:**
- Token refresh not implemented yet (not in scope) The Scale
| Score | Meaning | Action |
|---|---|---|
| 10/10 | Everything working, fully tested, no concerns | Move on |
| 8–9/10 | Core requirements met, minor gaps noted | Move on |
| 6–7/10 | Partial completion, some requirements missing | Fix before next task |
| Below 6 | Significant problems, unstable foundation | Stop, fix, rescore |
Why 8/10, Not Lower
Everything in software depends on things that came before. Broken auth = broken everything that needs auth. Incorrect schema = bad data everywhere. 7/10 on Task 3 compounds to 4/10 by Task 10.
The 8/10 rule keeps compounding errors from snowballing.
What Counts Toward the Score
High-weight items:
- Core requirements met (from the task spec)
- Tests written and passing
- Smoke test or browser test done
- Error cases handled
Doesn't lower score:
- Features explicitly deferred to a future sprint
- Nice-to-haves that weren't in the spec
When AI Overscores
AI sometimes gives 9/10 when the reality is 6/10. Challenge it:
"You gave 9/10 but the error handling isn't complete and there are no tests for the edge cases. What would it take to honestly get to 8/10?"
AI will usually acknowledge the gap and either do the work or provide a more honest score. Your verification matters — don't skip running the app and checking things manually.
Scoring Your Own Work
You can also score tasks yourself after review:
This looks like 7/10 to me. The login works but
error messages are not specific enough and I see no
tests for the token expiry case. Can you address those? AI will fix the gaps. Then rescore. This is the feedback loop that keeps quality consistent.