A Motivational Example

John built a travel-planning AI agent that worked flawlessly during development.
Three weeks after release, users suddenly reported:

repeated itinerary steps
broken tool calls
incorrect weather reasoning
the agent getting stuck in loops

John’s code hadn’t changed.
The foundation model had — and because his tests only checked exact outputs, nothing caught the regression.

This is the modern testing problem:

You’re not just testing code anymore.
You’re testing behavior.

Why Agent Testing Is Different

Agents introduce variability everywhere:

prompts
FM reasoning
tool usage
branching workflows
external context
FM updates

Traditional:

assertEqual(actual, expected)

Agent reality:

assert "intent" in output.lower()

You shift from exact outputs → signals of correct behavior.

Where Edge Cases Come From (4 Sources)

Prompt brittleness
Tool invocation failures
Workflow loops
FM non-determinism

These are the real failure modes of agent systems.

How to Unit Test Agent Behavior (Compact Guide)

1. Freeze randomness (deterministic mode)

agent = Agent(..., temperature=0)

This makes tests reproducible.

2. Test signals, not sentences

response = agent.run("Find flights to LAX")
assert "flight" in response.lower()
assert "lax" in response.lower()

3. Parameterize inputs

@pytest.mark.parametrize("query", [
    "find flights",
    "get flights",
    "need flights",
])
def test_flight_query(agent, query):
    assert "flight" in agent.run(query).lower()

4. Mock tools, not the FM

mock_price_api.return_value = {"total": 199}
response = agent.run("book a flight")
assert "199" in response

5. Include negative tests

with pytest.raises(ToolResponseError):
    agent.run("query with invalid schema")

The Blind Spot: Prompt Testing

A recent study found:

Only ~1% of tests examine prompt behavior.

But prompts break silently when:

the FM updates
tools change
workflows shift

A simple prompt regression test:

response = agent.run("summarize: apple")
assert "fruit" in response.lower()

The Mindset Shift

Assertions validate outputs.
Signals validate behavior.

Unit testing agents is about defining what “good reasoning” looks like — and checking for it.

Credit

This post is inspired by the paper “An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications” by Hasan et al., which analyzed 15,964 tests across real agent projects.

Image credit: Google Gemini

BeyondAssertions

Post #2: Beyond Assertions: How Do You Unit Test an AI Agent?

A Motivational Example

Why Agent Testing Is Different

Where Edge Cases Come From (4 Sources)

How to Unit Test Agent Behavior (Compact Guide)

1. Freeze randomness (deterministic mode)

2. Test signals, not sentences

3. Parameterize inputs

4. Mock tools, not the FM

5. Include negative tests

The Blind Spot: Prompt Testing

The Mindset Shift

Credit

Leave a comment Cancel reply

Post #2: Beyond Assertions: How Do You Unit Test an AI Agent?

A Motivational Example

Why Agent Testing Is Different

Where Edge Cases Come From (4 Sources)

How to Unit Test Agent Behavior (Compact Guide)

1. Freeze randomness (deterministic mode)

2. Test signals, not sentences

3. Parameterize inputs

4. Mock tools, not the FM

5. Include negative tests

The Blind Spot: Prompt Testing

The Mindset Shift

Credit

Share this:

Leave a comment Cancel reply