FILTERING AT ALL FOUR AI AGENT DATA BOUNDARIES
Secrets and adversarial content arrive through user prompts, tool arguments, tool results, and model responses — four distinct boundaries each requiring independent filtering. Here is why single-point defenses fail and what all-boundary enforcement looks like in practice.
THE PROBLEM WITH SINGLE-POINT FILTERING
Most teams add a content filter at the most obvious place: the user input. If someone types a secret into the chat interface, the filter catches it and stops the request. Problem solved.
Except secrets and adversarial content do not only arrive through the user. They arrive through tool results. A database query returns a record containing an embedded API key. A web scraper fetches a page containing injected instructions. A retrieval step returns a document that includes both legitimate content and a hidden command. The agent processes all of this in the same context window — indistinguishable from trusted input.
An AI agent operates across four distinct data boundaries. Filtering at only one of them is not defense in depth. It is defense at one point and an open path everywhere else.
THE FOUR BOUNDARIES
Data flows into and out of an agent run at four distinct points. Each is an independent interception opportunity — and an independent attack surface.
Boundary 1: User prompt → Model
Inbound policy fires here
User input, conversation history
Boundary 2: Model → Tool input
Pre-execution policy fires here
Tool call arguments constructed by the model
Boundary 3: Tool result → Model
Post-execution policy fires here
Database results, API responses, retrieved content
Boundary 4: Model output → User
Outbound policy fires here
Final model response textEach boundary is architecturally distinct. Filtering at one does not protect the others.
WHAT EACH BOUNDARY PROTECTS
1. User Prompt Input
The most obvious boundary — but still frequently skipped in favor of assuming users are trusted. Filtering here catches secrets in the user message before they reach the model, preventing inadvertent inclusion in logs, traces, and conversation history. It also catches direct injection attempts: adversarial instructions embedded in the user prompt designed to override the agent’s system behavior.
2. Tool Input Arguments
The model constructs tool arguments from context. If the context contains leaked credentials or injected instructions, those can propagate into tool calls — causing a database query to contain an injected payload, or a write tool to receive malformed input. Filtering here intercepts what the model does with bad input before it reaches external systems.
3. Tool Result Content
This is the highest-risk boundary in enterprise deployments. Tool results are treated by the model as trusted context — which makes them the preferred vector for indirect prompt injection. A result that contains a secret and returns it into the model’s context window will end up in the final response. A result containing embedded instructions may redirect agent behavior entirely. Filtering here prevents untrusted tool output from injecting into the agent’s reasoning.
4. Model Response Output
The final gate. Even after filtering inputs and results, the model may surface sensitive content it retrieved earlier — credentials from a tool result it processed ten steps back, or PII from a document it summarized mid-run. Output filtering is the last opportunity to prevent that content from reaching the user or downstream systems.
THE INDIRECT INJECTION PROBLEM
Boundaries 3 and 4 exist specifically because of indirect injection — the most dangerous attack vector in agentic systems and the one most implementations ignore.
Direct injection is easy to understand: a user sends a malicious prompt. Indirect injection is more subtle: a tool result, memory record, or retrieved document contains instructions that the model treats as authoritative. The user did not write it. The agent fetched it from a source it was told to trust.
The correct countermeasure is to treat all tool output as untrusted at the architecture layer — regardless of whether the model treats it as trusted. That means filtering at boundary 3.
HOW RUAKIEL IMPLEMENTS IT
Ruakiel’s firewall operates as a native capability wired into the agent lifecycle. It intercepts all four boundaries by implementing the corresponding hooks that the agent framework exposes.
Tenant administrators configure rules that apply block, redact, or flag actions to matched patterns. System rules — seeded automatically for every tenant — cover credentials, API keys, private keys, connection strings, and common secret formats. Tenant rules extend or override these without touching system defaults.
Rules are loaded once per agent run, not per hook invocation. This matters at production scale: an agent that calls twelve tools in a single run should pay one database read, not twelve.
Per-run lifecycle:
agent_start → load rules once from DB → cache
boundary 1 hook → filter prompt ↑ rules cached
boundary 2 hook → filter tool args ↑ rules cached
boundary 3 hook → filter tool result ↑ rules cached
boundary 4 hook → filter response ↑ rules cachedTHREE ACTIONS, NOT ONE
Filtering is not binary. Each firewall rule specifies one of three actions:
- Block. The content is replaced with a notice. The agent or user receives a message explaining that the content was blocked. This is appropriate for content that should never transit the system under any circumstances.
- Redact. The matched pattern is replaced with a placeholder (e.g.
[REDACTED]). The rest of the content is preserved and continues through the pipeline. This is appropriate for secrets that may appear in legitimate content. - Flag.The content passes through unchanged but is logged with a structured security event. This is appropriate for content that is worth auditing but should not interrupt the agent’s operation.
All events — block, redact, flag — are emitted as structured log entries with tenant ID, request ID, boundary, and rule metadata. This makes the firewall auditable, not just defensive.
THE PRINCIPLE
Data that enters or exits an AI agent cannot be trusted to be clean. Input may contain secrets the user typed without thinking. Tool results may contain secrets the database returned without concern for who is reading them. The model may reproduce either. None of these assumptions can be fixed by a better system prompt.
The architecture is the enforcement layer. Applying rules at every boundary — before and after, input and output — is what makes that enforcement reliable. Filtering at one boundary and trusting the rest is not defense in depth. It is a single gate on a four-sided perimeter.