A Motivational Example
John built a travel-planning AI agent that worked flawlessly during development.
Three weeks after release, users suddenly reported:
- repeated itinerary steps
- broken tool calls
- incorrect weather reasoning
- the agent getting stuck in loops
John’s code hadn’t changed.
The foundation model had — and because his tests only checked exact outputs, nothing caught the regression.
This is the modern testing problem:
You’re not just testing code anymore.
You’re testing behavior.
Why Agent Testing Is Different
Agents introduce variability everywhere:
- prompts
- FM reasoning
- tool usage
- branching workflows
- external context
- FM updates
Traditional:
assertEqual(actual, expected)
Agent reality:
assert "intent" in output.lower()
You shift from exact outputs → signals of correct behavior.
Where Edge Cases Come From (4 Sources)
- Prompt brittleness
- Tool invocation failures
- Workflow loops
- FM non-determinism
These are the real failure modes of agent systems.
How to Unit Test Agent Behavior (Compact Guide)
1. Freeze randomness (deterministic mode)
agent = Agent(..., temperature=0)
This makes tests reproducible.
2. Test signals, not sentences
response = agent.run("Find flights to LAX")
assert "flight" in response.lower()
assert "lax" in response.lower()
3. Parameterize inputs
@pytest.mark.parametrize("query", [
"find flights",
"get flights",
"need flights",
])
def test_flight_query(agent, query):
assert "flight" in agent.run(query).lower()
4. Mock tools, not the FM
mock_price_api.return_value = {"total": 199}
response = agent.run("book a flight")
assert "199" in response
5. Include negative tests
with pytest.raises(ToolResponseError):
agent.run("query with invalid schema")
The Blind Spot: Prompt Testing
A recent study found:
Only ~1% of tests examine prompt behavior.
But prompts break silently when:
- the FM updates
- tools change
- workflows shift
A simple prompt regression test:
response = agent.run("summarize: apple")
assert "fruit" in response.lower()
The Mindset Shift
Assertions validate outputs.
Signals validate behavior.
Unit testing agents is about defining what “good reasoning” looks like — and checking for it.
Credit
This post is inspired by the paper “An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications” by Hasan et al., which analyzed 15,964 tests across real agent projects.
Image credit: Google Gemini

Leave a comment