AI Coding Guide

AI-Assisted Software Engineering

Confidence Scoring

By Richard Osborne, CTO at Visual Hive

Last updated:

TLDR

Score every completed task out of 10 with evidence. 8/10 minimum to move on. Below that? Fix it before starting the next task. Confidence scoring prevents cascade failure — broken code building on broken code until the whole thing is a mess.

The Format

## Confidence: 8/10

**Completed:**
- [x] POST /auth/login working
- [x] POST /auth/register working
- [x] Error handling for 401/409/422
- [x] Unit tests passing (8/8)
- [x] Smoke tested in browser

**Deferred (by design, not failure):**
- [ ] Rate limiting → Sprint 2

**Concerns:**
- Token refresh not implemented yet (not in scope)

The Scale

ScoreMeaningAction
10/10Everything working, fully tested, no concernsMove on
8–9/10Core requirements met, minor gaps notedMove on
6–7/10Partial completion, some requirements missingFix before next task
Below 6Significant problems, unstable foundationStop, fix, rescore

Why 8/10, Not Lower

Everything in software depends on things that came before. Broken auth = broken everything that needs auth. Incorrect schema = bad data everywhere. 7/10 on Task 3 compounds to 4/10 by Task 10.

The 8/10 rule keeps compounding errors from snowballing.

What Counts Toward the Score

High-weight items:

  • Core requirements met (from the task spec)
  • Tests written and passing
  • Smoke test or browser test done
  • Error cases handled

Doesn't lower score:

  • Features explicitly deferred to a future sprint
  • Nice-to-haves that weren't in the spec

When AI Overscores

AI sometimes gives 9/10 when the reality is 6/10. Challenge it:

"You gave 9/10 but the error handling isn't complete and there are no tests for the edge cases. What would it take to honestly get to 8/10?"

AI will usually acknowledge the gap and either do the work or provide a more honest score. Your verification matters — don't skip running the app and checking things manually.

Scoring Your Own Work

You can also score tasks yourself after review:

This looks like 7/10 to me. The login works but
error messages are not specific enough and I see no
tests for the token expiry case. Can you address those?

AI will fix the gaps. Then rescore. This is the feedback loop that keeps quality consistent.

Building something with AI?

Talk to Visual Hive →