Skip to content

Step 5: Test

Part of: AI Workflow Framework

You’ve just finished Build (Step 4). You should have:

  • Platform artifacts — prompts, skills, agents, and configs generated for your platform
  • Context artifacts — style guides, reference materials, and examples
  • Design Spec ([name]-design-spec.md) — which includes the evaluation criteria and test scenarios defined during Design

Your first run is a test, not a deployment. The goal is to verify that the workflow produces good output before you share it with your team or use it on real work.

The skill runs seven phases. The sections that follow expand on each:

  1. Load artifacts and spec — Read the Design Spec (for evaluation criteria and test scenarios) and locate your platform artifacts.
  2. Smoke test — Run the workflow once with a realistic scenario. Check that it runs, produces output, and uses the right format.
  3. Full eval suite — Run each test scenario from the Design Spec. Score each output on a 1–5 scale across the evaluation dimensions.
  4. Building block evals — Test individual components (skills, context, agents) in isolation to pinpoint weak links.
  5. Establish baseline — Calculate average scores across all scenarios and dimensions. Record for future comparison.
  6. Diagnose and fix — Map problems to building blocks (generic output → context issue, skipped steps → prompt issue, etc.) and identify what to fix in Build.
  7. Readiness decision — Ready to deploy? Move to Run. Not ready? Return to Build with specific targets.

Start with a single test — pick one realistic scenario and run the workflow end to end. This is your smoke test. You are checking the basics:

  • Does it run at all? — Can you execute every step without errors?
  • Does it produce output? — Is there a result, or does it stall?
  • Is the output in the right format? — Does it look like what you expected?

If any of these fail, go back to Build and fix the obvious issue before continuing. Common first-run problems:

  • Missing context files the model references but cannot find
  • MCP connections that are not configured or are not responding
  • Skills that are installed but not correctly linked to the workflow

Once the smoke test passes, move to a full evaluation using the criteria defined during Design. Your Design Spec includes evaluation dimensions (the qualities you care about — accuracy, tone, completeness, specificity) and test scenarios (realistic inputs that exercise different parts of the workflow).

For each test scenario:

  1. Run the workflow with the test input

  2. Score the output on each evaluation dimension using a 1-5 scale:

    ScoreMeaning
    5Excellent — ready to use as-is
    4Good — minor edits only
    3Acceptable — needs some rework but the structure is right
    2Weak — significant gaps, wrong direction on one or more dimensions
    1Failure — output is unusable or fundamentally off-target
  3. Note specific issues — What exactly was wrong? Which dimension scored low and why?

Record your scores. These become your baseline for measuring improvement (during this test cycle) and regression (during Improve).

If the overall workflow scores poorly, test individual building blocks separately to isolate the problem:

  • Test a skill by running it with sample inputs outside the full workflow. Does it produce the expected output on its own?
  • Test context by asking the model a question that should be answerable from your reference materials. Does it find and use the right information?
  • Test an agent by giving it a single task from the workflow. Does it use its tools correctly? Does it make reasonable decisions?

Isolating building blocks helps you find the weak link without guessing.

After running the full eval suite, calculate an average score across all scenarios and dimensions. This is your baseline — the starting quality level of your workflow.

Record the baseline alongside your individual scores. You will use this number in two ways:

  1. During this test cycle — to measure whether your fixes are improving things
  2. During Improve (Step 7) — to detect quality regression over time

When something is off, the fix depends on what went wrong. Use this table to map problems to building blocks:

ProblemWhat to fix
Output is generic or off-brandAdd more context — examples, style guides, reference materials
Steps are skipped or misunderstoodRefine the prompt — make the instructions more explicit
A step needs domain expertise the AI does not haveBuild a skill for that step — codify the expertise into a reusable routine
The AI needs to make unpredictable decisionsConvert from prompt to agent — let the AI plan its approach
Output format is wrongCheck the prompt’s output format instructions — add explicit formatting examples
The model ignores your reference materialsCheck that context files are correctly linked and formatted — the model may not be finding them
Tool connections fail during executionVerify MCP connections — test each tool integration independently

After each fix, re-run the affected test scenarios and compare scores to your previous run. You are looking for improvement on the dimensions that scored low.

If you chose the code-first architecture approach during Design, you may encounter additional issues:

ProblemWhat to check
API calls return errorsVerify API keys, rate limits, and request format match the provider’s current spec
Agent does not use toolsCheck that tools are correctly registered in the agent configuration and that permissions are granted
Multi-agent handoffs failVerify the output format of each agent matches the expected input format of the next agent in the pipeline
Scheduled runs produce different resultsCheck for time-dependent context (dates, market data) that may have changed between runs
SDK version mismatchEnsure your SDK version matches the documentation the model used during Build — update if needed

After testing and iterating, you reach one of two outcomes:

Ready to deploy. You can run the workflow on a new scenario and trust the output without heavy editing. Your eval scores are at or above the threshold you set. Move to Step 6: Run.

Not ready. One or more dimensions consistently score below your threshold. Go back to Step 4: Build, fix the identified building blocks, and return to Test. Re-run the full eval suite — do not skip scenarios that passed previously, because a fix in one area can affect others.

This step is facilitated by the test AI Workflow Framework skill. See Set Up the Skills for installation instructions across all supported platforms.

Command: /handsonai:test (Claude Code) — or invoke by name on any other platform.

Platform compatibility: Claude Code ✓  |  Claude.ai ✓  |  Claude Cowork ✓  |  ChatGPT ✓  |  Gemini ✓  |  M365 Copilot ✓  |  Cursor / Codex / Antigravity ✓

Start with this prompt:

Test my workflow against the evaluation criteria in the Design Spec.

The skill guides you through the smoke test, eval suite, building block evals, baseline establishment, and diagnosis process.

"Test my workflow against the evaluation criteria"
→ Guides you through the smoke test, eval suite, and baseline
"My workflow output is too generic — help me diagnose"
→ Runs targeted building block evals to find the weak link
  • Design Your AI Workflow — where evaluation criteria and test scenarios are defined
  • Build — where you fix issues identified during testing
  • Run — the next step once your workflow passes testing
  • Improve — where you re-run evals to detect regression on deployed workflows