Step 5: Test¶

Part of: Business-First AI Framework

Where You Are¶

You've just finished Build (Step 4). You should have:

Platform artifacts — prompts, skills, agents, and configs generated for your platform
Context artifacts — style guides, reference materials, and examples
AI Building Block Spec ([name]-building-block-spec.md) — which includes the evaluation criteria and test scenarios defined during Design

Your first run is a test, not a deployment. The goal is to verify that the workflow produces good output before you share it with your team or use it on real work.

Your First Run¶

Start with a single test — pick one realistic scenario and run the workflow end to end. This is your smoke test. You are checking the basics:

Does it run at all? — Can you execute every step without errors?
Does it produce output? — Is there a result, or does it stall?
Is the output in the right format? — Does it look like what you expected?

If any of these fail, go back to Build and fix the obvious issue before continuing. Common first-run problems:

Missing context files the model references but cannot find
MCP connections that are not configured or are not responding
Skills that are installed but not correctly linked to the workflow

Don't optimize yet

The first run is about confirming the workflow functions. Resist the urge to fine-tune output quality — that comes next. Get it running, then evaluate.

Structured Evaluation¶

Once the smoke test passes, move to a full evaluation using the criteria defined during Design. Your AI Building Block Spec includes evaluation dimensions (the qualities you care about — accuracy, tone, completeness, specificity) and test scenarios (realistic inputs that exercise different parts of the workflow).

Run the Eval Suite¶

For each test scenario:

Run the workflow with the test input

Score the output on each evaluation dimension using a 1-5 scale:

Score	Meaning
5	Excellent — ready to use as-is
4	Good — minor edits only
3	Acceptable — needs some rework but the structure is right
2	Weak — significant gaps, wrong direction on one or more dimensions
1	Failure — output is unusable or fundamentally off-target

Note specific issues — What exactly was wrong? Which dimension scored low and why?

Record your scores. These become your baseline for measuring improvement (during this test cycle) and regression (during Improve).

Evaluate Building Blocks in Isolation¶

If the overall workflow scores poorly, test individual building blocks separately to isolate the problem:

Test a skill by running it with sample inputs outside the full workflow. Does it produce the expected output on its own?
Test context by asking the model a question that should be answerable from your reference materials. Does it find and use the right information?
Test an agent by giving it a single task from the workflow. Does it use its tools correctly? Does it make reasonable decisions?

Isolating building blocks helps you find the weak link without guessing.

Establish Your Baseline¶

After running the full eval suite, calculate an average score across all scenarios and dimensions. This is your baseline — the starting quality level of your workflow.

Record the baseline alongside your individual scores. You will use this number in two ways:

During this test cycle — to measure whether your fixes are improving things
During Improve (Step 7) — to detect quality regression over time

What's a passing score?

There is no universal threshold. A workflow that drafts internal meeting notes might be fine at 3.5 average. A workflow that generates client-facing proposals might need 4.5. You decide what "ready" means based on how much manual editing you are willing to accept.

Diagnose and Fix¶

When something is off, the fix depends on what went wrong. Use this table to map problems to building blocks:

Problem	What to fix
Output is generic or off-brand	Add more context — examples, style guides, reference materials
Steps are skipped or misunderstood	Refine the prompt — make the instructions more explicit
A step needs domain expertise the AI does not have	Build a skill for that step — codify the expertise into a reusable routine
The AI needs to make unpredictable decisions	Convert from prompt to agent — let the AI plan its approach
Output format is wrong	Check the prompt's output format instructions — add explicit formatting examples
The model ignores your reference materials	Check that context files are correctly linked and formatted — the model may not be finding them
Tool connections fail during execution	Verify MCP connections — test each tool integration independently

After each fix, re-run the affected test scenarios and compare scores to your previous run. You are looking for improvement on the dimensions that scored low.

Code-First Troubleshooting¶

If you chose the code-first architecture approach during Design, you may encounter additional issues:

Problem	What to check
API calls return errors	Verify API keys, rate limits, and request format match the provider's current spec
Agent does not use tools	Check that tools are correctly registered in the agent configuration and that permissions are granted
Multi-agent handoffs fail	Verify the output format of each agent matches the expected input format of the next agent in the pipeline
Scheduled runs produce different results	Check for time-dependent context (dates, market data) that may have changed between runs
SDK version mismatch	Ensure your SDK version matches the documentation the model used during Build — update if needed

Readiness Decision¶

After testing and iterating, you reach one of two outcomes:

Ready to deploy. You can run the workflow on a new scenario and trust the output without heavy editing. Your eval scores are at or above the threshold you set. Move to Step 6: Run.

Not ready. One or more dimensions consistently score below your threshold. Go back to Step 4: Build, fix the identified building blocks, and return to Test. Re-run the full eval suite — do not skip scenarios that passed previously, because a fix in one area can affect others.

2-4 iterations is normal

Most workflows need multiple rounds of Build-then-Test before they are ready for deployment. Each iteration should be targeted — fix a specific issue, re-test, and measure improvement. If you have been through four iterations and scores are not improving, consider going back to Design (Step 3) to re-examine your architecture decisions.

How to Use This¶

This step is facilitated by the test Business-First AI Framework skill. See Get the Skills for installation instructions across all supported platforms.

Start with this prompt:

Test my workflow against the evaluation criteria in the Building Block Spec.

The skill guides you through the smoke test, eval suite, building block evals, baseline establishment, and diagnosis process.

Design Your AI Workflow — where evaluation criteria and test scenarios are defined
Build — where you fix issues identified during testing
Run — the next step once your workflow passes testing
Improve — where you re-run evals to detect regression on deployed workflows