← Back to Blog
·8 min read·Ruakiel Team

FILTERING AT ALL FOUR AI AGENT DATA BOUNDARIES

Secrets and adversarial content arrive through user prompts, tool arguments, tool results, and model responses — four distinct boundaries each requiring independent filtering. Here is why single-point defenses fail and what all-boundary enforcement looks like in practice.

SecurityFirewallDefense in Depth

THE PROBLEM WITH SINGLE-POINT FILTERING

Most teams add a content filter at the most obvious place: the user input. If someone types a secret into the chat interface, the filter catches it and stops the request. Problem solved.

Except secrets and adversarial content do not only arrive through the user. They arrive through tool results. A database query returns a record containing an embedded API key. A web scraper fetches a page containing injected instructions. A retrieval step returns a document that includes both legitimate content and a hidden command. The agent processes all of this in the same context window — indistinguishable from trusted input.

An AI agent operates across four distinct data boundaries. Filtering at only one of them is not defense in depth. It is defense at one point and an open path everywhere else.

THE FOUR BOUNDARIES

Data flows into and out of an agent run at four distinct points. Each is an independent interception opportunity — and an independent attack surface.

Boundary 1: User prompt → Model Inbound policy fires here User input, conversation history Boundary 2: Model → Tool input Pre-execution policy fires here Tool call arguments constructed by the model Boundary 3: Tool result → Model Post-execution policy fires here Database results, API responses, retrieved content Boundary 4: Model output → User Outbound policy fires here Final model response text

Each boundary is architecturally distinct. Filtering at one does not protect the others.

WHAT EACH BOUNDARY PROTECTS

1. User Prompt Input

The most obvious boundary — but still frequently skipped in favor of assuming users are trusted. Filtering here catches secrets in the user message before they reach the model, preventing inadvertent inclusion in logs, traces, and conversation history. It also catches direct injection attempts: adversarial instructions embedded in the user prompt designed to override the agent’s system behavior.

2. Tool Input Arguments

The model constructs tool arguments from context. If the context contains leaked credentials or injected instructions, those can propagate into tool calls — causing a database query to contain an injected payload, or a write tool to receive malformed input. Filtering here intercepts what the model does with bad input before it reaches external systems.

3. Tool Result Content

This is the highest-risk boundary in enterprise deployments. Tool results are treated by the model as trusted context — which makes them the preferred vector for indirect prompt injection. A result that contains a secret and returns it into the model’s context window will end up in the final response. A result containing embedded instructions may redirect agent behavior entirely. Filtering here prevents untrusted tool output from injecting into the agent’s reasoning.

4. Model Response Output

The final gate. Even after filtering inputs and results, the model may surface sensitive content it retrieved earlier — credentials from a tool result it processed ten steps back, or PII from a document it summarized mid-run. Output filtering is the last opportunity to prevent that content from reaching the user or downstream systems.

THE INDIRECT INJECTION PROBLEM

Boundaries 3 and 4 exist specifically because of indirect injection — the most dangerous attack vector in agentic systems and the one most implementations ignore.

Direct injection is easy to understand: a user sends a malicious prompt. Indirect injection is more subtle: a tool result, memory record, or retrieved document contains instructions that the model treats as authoritative. The user did not write it. The agent fetched it from a source it was told to trust.

The correct countermeasure is to treat all tool output as untrusted at the architecture layer — regardless of whether the model treats it as trusted. That means filtering at boundary 3.

HOW RUAKIEL IMPLEMENTS IT

Ruakiel’s firewall operates as a native capability wired into the agent lifecycle. It intercepts all four boundaries by implementing the corresponding hooks that the agent framework exposes.

Tenant administrators configure rules that apply block, redact, or flag actions to matched patterns. System rules — seeded automatically for every tenant — cover credentials, API keys, private keys, connection strings, and common secret formats. Tenant rules extend or override these without touching system defaults.

Rules are loaded once per agent run, not per hook invocation. This matters at production scale: an agent that calls twelve tools in a single run should pay one database read, not twelve.

Per-run lifecycle: agent_start → load rules once from DB → cache boundary 1 hook → filter prompt ↑ rules cached boundary 2 hook → filter tool args ↑ rules cached boundary 3 hook → filter tool result ↑ rules cached boundary 4 hook → filter response ↑ rules cached

THREE ACTIONS, NOT ONE

Filtering is not binary. Each firewall rule specifies one of three actions:

  • Block. The content is replaced with a notice. The agent or user receives a message explaining that the content was blocked. This is appropriate for content that should never transit the system under any circumstances.
  • Redact. The matched pattern is replaced with a placeholder (e.g. [REDACTED]). The rest of the content is preserved and continues through the pipeline. This is appropriate for secrets that may appear in legitimate content.
  • Flag.The content passes through unchanged but is logged with a structured security event. This is appropriate for content that is worth auditing but should not interrupt the agent’s operation.

All events — block, redact, flag — are emitted as structured log entries with tenant ID, request ID, boundary, and rule metadata. This makes the firewall auditable, not just defensive.

THE PRINCIPLE

Data that enters or exits an AI agent cannot be trusted to be clean. Input may contain secrets the user typed without thinking. Tool results may contain secrets the database returned without concern for who is reading them. The model may reproduce either. None of these assumptions can be fixed by a better system prompt.

The architecture is the enforcement layer. Applying rules at every boundary — before and after, input and output — is what makes that enforcement reliable. Filtering at one boundary and trusting the rest is not defense in depth. It is a single gate on a four-sided perimeter.