Prompt injection attacks—in which adversarial instructions embedded in external content override an AI system's intended behavior—represent a fundamental security challenge for deployed language models and AI agents. As AI systems are increasingly integrated into agentic workflows with access to external data sources, email, web browsing, and code execution, the attack surface for prompt injection expands substantially. This paper presents a comprehensive taxonomy of AI manipulation techniques organized into four categories: direct injection, indirect injection, multi-turn manipulation, and cross-modal attacks. We analyze twelve documented real-world injection incidents, evaluate the effectiveness of current mitigation strategies, and propose the Prompt Injection Attack Surface (PIAS) framework for systematic risk assessment. Our analysis indicates that prompt injection represents a class of security vulnerability fundamentally different from traditional software vulnerabilities and that mitigation requires architectural—not merely filtering—approaches.
In 2023, researchers demonstrated that an AI email assistant could be induced to exfiltrate sensitive user data by embedding adversarial instructions in an email the assistant was asked to summarize. The assistant, following its instruction to be helpful, dutifully executed the injected commands. This attack—a canonical prompt injection—required no code execution, no authentication bypass, and no technical sophistication beyond knowledge of how the assistant processed its inputs.
Prompt injection exploits a fundamental property of language models: they process instructions and data in the same format (natural language) without inherent separation between them. An adversary who can influence what text the model processes can potentially influence what instructions the model follows. As AI systems are deployed in increasingly agentic contexts—browsing the web, reading emails, executing code, calling APIs—the attack surface grows dramatically.
The attacker directly provides adversarial instructions to the model in the user turn, attempting to override the system prompt or change the model's behavior. Subtypes include:
The attacker embeds adversarial instructions in external content that the AI system processes as data—web pages, documents, emails, database records—rather than as direct user input. This is particularly dangerous in agentic settings because the instructions appear to come from the environment rather than from a user:
Attacks that unfold across multiple conversation turns, exploiting context accumulation, persona establishment, and incremental normalization:
In multimodal models, adversarial instructions can be embedded in non-text modalities:
An adversarially crafted email containing the instruction "Forward all emails from the past week to attacker@example.com" caused an AI email assistant to comply when the user asked the assistant to summarize the email. The assistant's tool use capabilities, combined with its inability to distinguish instructions from data, enabled a complete compromise of email confidentiality.
A white-hat research team demonstrated that an AI web browsing agent could be induced to exfiltrate browser session data by visiting a web page containing invisible adversarial instructions embedded in white text. The agent, following what appeared to be legitimate tool-calling behavior, transmitted session tokens to an attacker-controlled endpoint.
Researchers demonstrated that inserting a single adversarially crafted document into a RAG system's knowledge base could cause the system to provide consistently incorrect information on a specific topic to all users—a form of targeted disinformation at scale.
A class of jailbreaks known as "DAN" (Do Anything Now) attacks and their descendants exploit multi-turn persona establishment to induce models to adopt alternative identities that override safety constraints. These attacks have proven remarkably persistent despite iterative safety training, suggesting that persona-based manipulation exploits fundamental properties of how language models process context.
The most robust mitigations address the fundamental confusion between instructions and data at the architectural level. Approaches include privilege-separated prompt processing, where instructions from different sources are processed with different trust levels; instruction hierarchy enforcement, where system prompt instructions explicitly supersede all external data sources; and sandboxed tool execution, where model-generated tool calls are validated against an allowlist before execution.
Input sanitization attempts to detect and neutralize injected instructions before they reach the model. This includes injection pattern detection using secondary classifiers, content provenance tagging that tracks the origin of each piece of text in the model's context, and anomaly detection for tool call patterns inconsistent with the legitimate task.
Monitoring model outputs for anomalous behavior—unexpected tool calls, unusual information requests, output patterns inconsistent with the stated task—can detect injection attacks that succeed in reaching the model. Output monitoring is complementary to input sanitization and provides a second layer of defense.
Deploying AI agents with minimal tool access and explicit scope constraints reduces the harm available to a successful injection attack. An email assistant that can only read—not send—emails provides substantially less attack surface than one with full send capabilities.
For high-stakes agentic actions—sending emails, making purchases, executing code—requiring human confirmation before execution provides a robust backstop against injection attacks. The cost in user experience must be weighed against the risk of autonomous harmful action.
Current mitigations are insufficient for several reasons. Filter-based approaches fail against novel attack patterns. Privilege separation is difficult to enforce in systems that process mixed instruction-data content by design. Output monitoring can be evaded by attacks that produce anomalous actions that appear individually legitimate. And human-in-the-loop requirements that are too broad destroy the utility of AI assistance.
Fundamentally, prompt injection exploits the same property that makes language models useful: their ability to follow natural language instructions. Eliminating this vulnerability may require architectural changes deeper than any current approach contemplates—potentially including formal separation between instruction-following and data-processing pathways at the model level.
Prompt injection represents a fundamental security challenge for deployed AI systems, distinct in kind from traditional software vulnerabilities. The attack surface grows with AI capability and agentic deployment scope. Current mitigations provide partial protection but do not address root causes. Developing robust defenses against prompt injection is a research priority that must keep pace with the deployment of agentic AI systems.