ai-agent-dev-workflow

Spec Validation Checklist

Run this checklist during Phase 4 after the spec is drafted and clarifications are resolved. Each point must be verified with specific evidence from the spec file.

Pass criteria: all 8 points pass, or remaining failures are documented in the spec’s Notes section with rationale.

Max 2 fix iterations. If issues persist after 2 rounds, mark the spec as Ready-with-warnings and proceed.

1. Testability

Every acceptance criterion can be converted to a test.

How to check:

For each Given/When/Then, ask: “Can I write a test that passes when this works and fails when it doesn’t?”
The test complexity hint — (1 test), (parameterized), (integration) — should be present and accurate
Red flags: criteria using “appropriate”, “graceful”, “intuitive”, “fast” without a measurable threshold

Common misses:

“System handles errors properly” — HOW? What does the user see?
“Performance is acceptable” — What’s the threshold?
“UI is responsive” — At what breakpoint? What changes?
“Data is validated” — Which fields? What rules? What error?

Self-evaluation warning: It’s tempting to say “all criteria are testable” without actually imagining the test. For each criterion, mentally construct the specific assertion — don’t just pattern-match the Given/When/Then structure.

Adversarial check technique: For each criterion, ask: “Could an implementation pass this criterion while being completely wrong?” If yes, the criterion is too vague. Example: “Then the system responds appropriately” passes for ANY response. “Then the system returns a 400 status with field-level error messages” can only pass when correct.

2. Completeness

Every user story has acceptance criteria. Every P1 story has both happy-path and error-path criteria.

How to check:

Count stories vs criteria. P1 stories should have 2+ criteria (at least one happy path, one error path)
P2/P3 stories need at least 1 criterion each
No story should have an empty acceptance criteria section

Variant completeness: If a story’s behavior branches by N variants (scenarios, user roles, platforms, modes), each variant needs at least one criterion. Count the variants mentioned in the story, count the criteria covering each — flag uncovered variants. Example: a story says “generates output for 4 scenarios” but only 2 of 4 have criteria — the other 2 are untested.

Common misses:

Happy path only (no error scenarios for P1 stories)
Missing “what if the user cancels midway?”
Missing “what if input is empty/invalid?”
Missing “what if the operation partially succeeds?”
Story branches by N variants but only some have criteria

3. Clarity

No vague adjectives without measurable targets. No jargon without definition. No pronouns with ambiguous referents.

How to check:

Search for: fast, slow, large, small, many, few, appropriate, suitable, robust, scalable, intuitive, seamless, modern, responsive, secure, efficient, reliable
Each must have a measurable qualifier or be removed/replaced
Domain terms should be self-evident from context or defined on first use

Common misses:

“Large files” — how large? 1MB? 1GB? 100GB?
“Recent items” — how recent? Last hour? Last 7 days?
“Quickly loads” — under 200ms? Under 1s? Under 5s?
“Secure access” — authentication? authorization? encryption?

4. Scope

Boundaries are explicit. There is an “Out of Scope” section. P1/P2/P3 boundaries are defensible.

How to check:

“Out of Scope” section exists and has at least one item (for medium+ features)
No “while we’re at it” additions hiding in P1 stories
Could someone argue a P2 should be P1? If yes, the “Why P_” rationale should address it

Common misses:

Scope defined by what’s IN but not what’s OUT
P3 stories that are actually essential for P1 to be usable
Feature creep disguised as “edge case handling”
Unbounded requirements (“supports all formats”)

5. Independence

User stories can be implemented and tested independently. P1 stories deliver standalone value.

How to check:

For each P1 story, ask: “Can I ship this alone and demo it?”
No circular dependencies between stories
P2/P3 may depend on P1 but not on each other
Stories don’t share setup assumptions (“US2 assumes US1 ran first”)

Common misses:

Stories that are really sub-tasks of a single story
A “view” story that assumes a “create” story shipped first (acceptance criteria should handle the empty state)
Stories that require shared infrastructure not in any story’s acceptance criteria

6. Priority

Stories are prioritized P1/P2/P3. The P1 set forms a coherent MVP that delivers user value on its own.

How to check:

At least one P1 story exists
The P1 set alone would be a useful, shippable increment
Every story has a “Why P_” justification
Everything is NOT P1 — if it is, priorities aren’t real

Priority dependency check: If a P1 story depends on work that’s currently P2 or P3 (or doesn’t exist yet), flag it:

Option A: Promote the dependency to P1
Option B: Restructure the P1 story to remove the dependency
Option C: Document the dependency in Notes with a plan

Common misses:

All stories marked P1 (no real prioritization)
P1 stories that can’t function without P2 stories
Priority based on technical ordering instead of user value

7. Edge Cases

Key failure modes and boundary conditions are identified. At minimum: empty state, error state, boundary values.

How to check:

Edge Cases section is non-empty
Cross-reference with taxonomy categories 5 (Error Handling) and 6 (Edge Cases & Boundaries)
Each P1 story has at least one edge case
Domain-specific concerns from PROJECT.md are addressed
Category tags [Category N] are present for traceability

Common misses:

Only feature-level edge cases (missing domain interaction)
Timezone/locale edge cases for time-sensitive features
Concurrent operation scenarios for shared data
Empty state (no data yet) vs error state (failed to load)

8. Resolution

No unresolved [NEEDS CLARIFICATION] markers remain.

How to check:

Search the spec for [NEEDS CLARIFICATION]
If any remain, they must be either:
- Resolved (marker replaced with the answer)
- Deferred with rationale in Notes section: [DEFERRED: not clarified — <reason>]

Common misses:

Markers buried in nested sections
Markers resolved in Clarifications section but not updated in the original location
New ambiguities introduced during Phase 4 fixes

Project-Specific Validation

After the 8 generic points above, check the spec against project-specific criteria:

Load PROJECT.md “Domain-Specific Concerns” section
Verify each concern is addressed (in Edge Cases, Acceptance Criteria, or explicitly in Out of Scope)
Load PROJECT.md “Quality Standards” section
Verify each standard is met

Document project-specific failures the same way as generic failures — fix or add to Notes with rationale.

Handling Failures

If all 8 points pass: Mark spec status as Ready. Proceed to Phase 5 (Handoff).

If points fail (iteration 1): Fix the spec directly. Re-run the checklist. Most failures are fixable without user input (adding missing error criteria, removing vague adjectives, adding Out of Scope items).

If points still fail (iteration 2): Fix what you can. Document remaining issues in the spec’s Notes section with rationale for why they couldn’t be resolved. Mark spec status as Ready-with-warnings. Proceed to Phase 5 with a warning in the handoff summary.

Do not iterate more than twice. Diminishing returns. The remaining issues are likely ambiguities that need human input during TDD implementation, not spec-level resolution.

This site is open source. Improve this page.