Step 5: Test

Part of: AI Workflow Framework

Where You Are

You’ve just finished Build (Step 4). You should have:

Platform artifacts — prompts, skills, agents, and configs generated for your platform
Context artifacts — style guides, reference materials, and examples
Design Spec ([name]/design-spec.md) — the architecture blueprint and artifact inventory
Workflow Requirements ([name]/requirements.md) — which owns the Acceptance Criteria, Example Scenarios, and Golden Examples captured during Deconstruct

Your first run is a test, not a deployment. The goal is to verify that the workflow produces good output before you share it with your team or use it on real work.

How the Skill Works

The skill runs seven phases. The sections that follow expand on each:

Load artifacts and spec — Read the Design Spec, the Workflow Requirements (for acceptance criteria, test scenarios, and golden examples), and locate your platform artifacts.
Smoke test — Run the workflow once with a realistic scenario. Check that it runs, produces output, and uses the right format.
Full eval suite — Run each test scenario from the Workflow Requirements. Score each output on a 1–5 scale across the evaluation dimensions.
Building block evals — Test individual components (skills, context, agents) in isolation to pinpoint weak links.
Establish baseline — Calculate average scores across all scenarios and dimensions. Record for future comparison.
Diagnose and fix — Map problems to building blocks (generic output → context issue, skipped steps → prompt issue, etc.) and identify what to fix in Build.
Readiness decision — Ready to deploy? Move to Run. Not ready? Return to Build with specific targets.

The skill introduces the testing vocabulary in plain language as it goes — a scenario (E1, E2…) is one realistic test input, the eval suite is running the workflow across all of them, and a baseline is the saved scorecard you compare against later to catch quality slipping — so you don’t need to know the terms in advance.

Your First Run

Start with a single test — pick one realistic scenario and run the workflow end to end. This is your smoke test. You are checking the basics:

Does it run at all? — Can you execute every step without errors?
Does it produce output? — Is there a result, or does it stall?
Is the output in the right format? — Does it look like what you expected?

If any of these fail, go back to Build and fix the obvious issue before continuing. Common first-run problems:

Missing context files the model references but cannot find
MCP connections that are not configured or are not responding
Skills that are installed but not correctly linked to the workflow

Structured Evaluation

Once the smoke test passes, move to a full evaluation using the criteria captured during Deconstruct. Your Workflow Requirements includes evaluation dimensions (the qualities you care about — accuracy, tone, completeness, specificity), test scenarios (realistic inputs that exercise different parts of the workflow), and — where you supplied them — Golden Examples: real past outputs you’d consider “exactly right.” Golden examples are the strongest evaluation tool you have, because scoring becomes “compare against this reference” instead of “how does it feel?”

Run the Eval Suite

For each test scenario:

Run the workflow with the test input
Let the AI grade first — the model scores the output against the acceptance criteria (and the golden example, if one exists), with a one-line justification quoting specific evidence. You confirm or adjust each score, so every scenario gets a consistent, evidence-based starting point while you stay the final judge of quality

Score the output on each evaluation dimension using a 1-5 scale:

Score	Meaning
5	Excellent — ready to use as-is
4	Good — minor edits only
3	Acceptable — needs some rework but the structure is right
2	Weak — significant gaps, wrong direction on one or more dimensions
1	Failure — output is unusable or fundamentally off-target

Note specific issues — What exactly was wrong? Which dimension scored low and why?

Record your scores. The Test Results file opens with machine-readable frontmatter — per-scenario scores, averages, the environment tested in, and the readiness verdict — so Improve can later diff a regression run against this baseline mechanically instead of comparing recollections.

Evaluate Building Blocks in Isolation

If the overall workflow scores poorly, test individual building blocks separately to isolate the problem:

Test a skill by running it with sample inputs outside the full workflow. Does it produce the expected output on its own?
Test context by asking the model a question that should be answerable from your reference materials. Does it find and use the right information?
Test an agent by giving it a single task from the workflow. Does it use its tools correctly? Does it make reasonable decisions?

Isolating building blocks helps you find the weak link without guessing.

Establish Your Baseline

After running the full eval suite, calculate an average score across all scenarios and dimensions. This is your baseline — the starting quality level of your workflow.

Record the baseline alongside your individual scores. You will use this number in two ways:

During this test cycle — to measure whether your fixes are improving things
During Improve (Step 7) — to detect quality regression over time

Diagnose and Fix

When something is off, the fix depends on what went wrong. Use this table to map problems to building blocks:

Problem	What to fix
Output is generic or off-brand	Add more context — examples, style guides, reference materials
Steps are skipped or misunderstood	Refine the prompt — make the instructions more explicit
A step needs domain expertise the AI does not have	Build a skill for that step — codify the expertise into a reusable routine
The AI needs to make unpredictable decisions	Convert from prompt to agent — let the AI plan its approach
Output format is wrong	Check the prompt’s output format instructions — add explicit formatting examples
The model ignores your reference materials	Check that context files are correctly linked and formatted — the model may not be finding them
Tool connections fail during execution	Verify MCP connections — test each tool integration independently

After each fix, re-run the affected test scenarios and compare scores to your previous run. You are looking for improvement on the dimensions that scored low.

Code-First Troubleshooting

If you chose the code-first architecture approach during Design, you may encounter additional issues:

Problem	What to check
API calls return errors	Verify API keys, rate limits, and request format match the provider’s current spec
Agent does not use tools	Check that tools are correctly registered in the agent configuration and that permissions are granted
Multi-agent handoffs fail	Verify the output format of each agent matches the expected input format of the next agent in the pipeline
Scheduled runs produce different results	Check for time-dependent context (dates, market data) that may have changed between runs
SDK version mismatch	Ensure your SDK version matches the documentation the model used during Build — update if needed

Readiness Decision

After testing and iterating, you reach one of two outcomes:

Ready to deploy. You can run the workflow on a new scenario and trust the output without heavy editing. Your eval scores are at or above the threshold you set. Move to Step 6: Run.

Not ready. One or more dimensions consistently score below your threshold. Go back to Step 4: Build, fix the identified building blocks, and return to Test. Re-run the full eval suite — do not skip scenarios that passed previously, because a fix in one area can affect others.

How to Use This

This step is facilitated by the test AI Workflow Framework skill. See Set Up the Skills for installation instructions across all supported platforms.

How to start: Say “run the test skill” (or “test the workflow”) — works on every platform. On Claude Code or Cowork with the plugin installed, you can also type /handsonai:test.

Start with this prompt:

Test my workflow against the acceptance criteria in my Workflow Requirements.

The skill guides you through the smoke test, eval suite, building block evals, baseline establishment, and diagnosis process.

Example prompts

"Test my workflow against the evaluation criteria"
→ Guides you through the smoke test, eval suite, and baseline

"My workflow output is too generic — help me diagnose"
→ Runs targeted building block evals to find the weak link

Deconstruct Workflows — where acceptance criteria, example scenarios, and golden examples are captured
Build — where you fix issues identified during testing
Run — the next step once your workflow passes testing
Improve — where you re-run evals to detect regression on deployed workflows