Guardrails

What It Is

Guardrails are automated rules and constraints that govern what an agent can and cannot do. They operate continuously during agent execution, checking inputs, outputs, and actions against defined policies — and blocking or modifying anything that violates those policies.

Unlike human-in-the-loop controls, guardrails are automated. They don’t require a human to review every action — they enforce rules programmatically, allowing the agent to operate autonomously within defined boundaries.

Why It Matters

As agents gain more autonomy — making decisions, calling tools, taking actions — the potential for harm increases. An agent without guardrails can hallucinate confidently, leak sensitive data, make unauthorized purchases, or provide advice it shouldn’t.

Guardrails are the difference between a useful autonomous system and a liability. They let you grant agents more autonomy (which makes them more useful) while maintaining safety (which makes them trustworthy). Production agent systems always need guardrails — the question is not whether to add them, but which ones and where.

How It Works

Guardrails can be applied at multiple points in the agent pipeline:

┌────────┐   ┌──────────┐   ┌────────┐   ┌──────────┐   ┌────────┐
│ Input  │──▸│  Input    │──▸│ Agent  │──▸│  Output   │──▸│ Output │
│        │   │  Guards   │   │ (LLM)  │   │  Guards   │   │        │
└────────┘   └──────────┘   └────────┘   └──────────┘   └────────┘

Types of guardrails

Input guardrails — Filter or reject problematic inputs before they reach the agent:

Block prompt injection attempts
Reject off-topic requests (“I can help with order management, but I can’t help with medical advice”)
Sanitize sensitive data (redact credit card numbers before processing)

Output guardrails — Check agent responses before they reach the user:

Block responses containing personally identifiable information (PII)
Ensure responses stay within the agent’s approved topic area
Verify factual claims against a knowledge base
Enforce tone and brand voice guidelines

Action guardrails — Restrict what tools the agent can call and with what parameters:

Limit refund amounts (“agent can issue refunds up to $100; anything higher requires approval”)
Restrict database access to read-only queries
Block destructive operations (delete, overwrite)
Enforce rate limits on API calls

Constitutional guardrails — Baked into the model’s behavior through system prompts or fine-tuning:

Anthropic’s Constitutional AI trains models to follow a set of principles
System prompts that define the agent’s role and boundaries
Instructions like “Never provide medical, legal, or financial advice”

Example

Customer exchange scenario

An exchange-processing agent has these guardrails:

Guardrail	Type	Rule
Refund cap	Action	Cannot issue refunds over $200 without escalation
Final sale block	Action	Cannot process returns on items marked “final sale”
PII filter	Output	Redacts credit card numbers and SSNs from responses
Scope limit	Input	Rejects requests unrelated to order management
Policy compliance	Output	Verifies that the response cites the correct return policy

A customer asks: “Can I return this final-sale item?” The action guardrail blocks the return process, and the agent responds: “I’m sorry, final-sale items are not eligible for return or exchange per our policy. I can help you find an alternative product if you’d like.”

Code generation agent

A coding agent has guardrails to prevent generating insecure code:

Block — SQL queries built with string concatenation (SQL injection risk)
Warn — API keys or secrets hardcoded in source code
Enforce — All file writes must go through a sandbox directory

When to Use It

Any production agent system — guardrails are not optional for deployed agents
Agents with access to tools that can take real-world actions (payments, emails, database writes)
Customer-facing agents where brand safety matters
Regulated industries (healthcare, finance, legal) with compliance requirements
Multi-agent systems where individual agents need scoped permissions

Human-in-the-Loop — Guardrails handle routine constraints automatically; HITL handles exceptions and edge cases
Tool Use — Action guardrails govern which tools the agent can access
Reflection — Self-reflection is a soft guardrail; automated guardrails are hard constraints
Agent Capability Patterns