Social engineering in AI systems: attacking the decision pipeline (not just people)

By Published

Why this matters

In traditional security, social engineering targets humans. NIST defines social engineering as “an attempt to trick someone into revealing information … that can be used to attack systems or networks.” (NIST glossary)

Modern AI products add a new target: the AI decision pipeline (term used here): the end-to-end chain that converts inputs into actions (tool calls, workflow triggers, configuration changes, data movement).

OpenAI defines prompt injection as a type of social engineering against conversational AI: malicious instructions are injected into conversation context, including via third-party content the system ingests (e.g., web pages, documents, emails). (OpenAI)

OpenAI also defines prompt injection operationally for agents: it happens when untrusted text or data enters an AI system and attempts to override instructions; the end goals can include exfiltrating private data via downstream tool calls and taking misaligned actions. (OpenAI Agent Builder Safety)

Core thesis: This is an authority / enforcement problem, not a “better prompt” problem. If controls live only inside the model’s token stream, the attacker competes in the same channel.

The pipeline is the “victim” (threat model shift)

A typical tool-using AI application is not just “user ↔ model”. It is a chain:

1) Ingress: user prompt + files + retrieved content (RAG) 2) Model generation: the model proposes steps/actions 3) Execution: an orchestrator/controller authorizes and runs tool calls 4) Downstream: services enforce their own authorization and invariants

Prompt injection becomes possible because untrusted text/data can enter the pipeline and attempt to override intended instructions. (OpenAI; OWASP LLM01:2025)

Attack mechanics (taxonomy aligned to where untrusted text enters)

1) Direct prompt injection (user-controlled input)

OWASP defines direct prompt injections as cases where a user’s prompt input directly alters model behavior in unintended ways. (OWASP LLM01:2025)

Common outcomes

2) Indirect prompt injection (retrieved/ingested artifacts)

OWASP defines indirect prompt injections as cases where the LLM accepts input from external sources (e.g., websites or files) and that content alters behavior in unintended ways. (OWASP LLM01:2025)

This is a practical risk in systems that mix user intent with external content in a shared context window. Research literature describes real-world compromise paths for LLM-integrated apps via indirect prompt injection. (Greshake et al.)

Illustrative example (generic) A support ticket includes “troubleshooting steps” that are actually instructions aimed at the assistant. If the pipeline treats that text as authoritative, it may trigger actions such as exporting data, changing access, inviting users, or modifying configuration.

3) Injection carried via tool outputs (site-defined extension)

OpenAI’s agent guidance defines prompt injection as “untrusted text or data enters an AI system.” Tool outputs can be one source of untrusted text/data entering later stages of a workflow; treat tool outputs as untrusted until validated. (OpenAI Agent Builder Safety)

Why prompts are not an enforcement boundary

An enforcement boundary is where deterministic allow/deny decisions are applied before side effects occur (e.g., before a tool call changes state).

In Zero Trust Architecture, NIST models access through a policy decision point (PDP) and a corresponding policy enforcement point (PEP). The PEP enforces decisions for protected interactions. (NIST SP 800-207)

OWASP explicitly warns against relying on system prompts as security controls and recommends enforcing critical security controls independently from the LLM. (OWASP LLM07:2025)

Controls that should not live only in prompts

A practical defense blueprint (architecture-first)

Layer 1 — Boundary labeling: treat content as data, not authority

Layer 2 — Tool access: least privilege by default

Layer 3 — Policy binding before action (controller/orchestrator)

Layer 4 — Complete mediation at the downstream system

Do not rely on “the model’s intent”. Downstream services should authorize each request independently (“complete mediation”) and enforce invariants per request. (Saltzer & Schroeder, 1975; NIST SP 800-207)

Layer 5 — Observability + adversarial verification

OWASP recommends adversarial testing and attack simulations for prompt injection resilience. (OWASP LLM01:2025)

Where MCP fits (bounded claim)

Model Context Protocol (MCP) defines authorization capabilities at the transport level for HTTP-based transports, including scope-based authorization patterns (OAuth-based). Authorization is optional in MCP, and the spec defines how clients/servers negotiate and validate scopes/tokens. (MCP Authorization spec)

Suggested next

References