Security March 2026

Information Poisoning in AI Training and Agent Pipelines: Threats and Defenses

Safe AI for Humanity Foundation · Working Paper · March 2026

Abstract

Information poisoning attacks compromise AI systems by corrupting the data they learn from or the information they access at inference time. Unlike prompt injection attacks that target the inference interface, poisoning attacks operate upstream—in training data, fine-tuning datasets, retrieval knowledge bases, and agent context pipelines—making them harder to detect and potentially more persistent. This paper presents a unified framework for information poisoning threats across the AI deployment lifecycle, organized into four categories: training data poisoning, fine-tuning poisoning, retrieval-augmented generation (RAG) poisoning, and agent pipeline context poisoning. For each category, we analyze attack mechanisms, documented incidents, and available defenses. We propose the Information Poisoning Threat Decomposition (IPTD) framework for systematic threat modeling and introduce detection techniques applicable across poisoning contexts.

data poisoning training poisoning backdoor attacks RAG security agent pipelines supply chain

1. Introduction

In 2021, researchers demonstrated that inserting fewer than 100 carefully crafted examples into a large language model's pretraining corpus could cause the model to reliably produce attacker-specified outputs when presented with a specific trigger phrase—without any change in behavior on normal inputs. This backdoor attack illustrated a disturbing asymmetry: a tiny, carefully placed poison can have outsized, persistent effects on a model trained on billions of tokens.

Information poisoning is broader than backdoor attacks. It encompasses any attack that corrupts the information an AI system learns from or operates on, with the goal of degrading performance, introducing biases, implanting backdoors, or enabling attacker control. As AI systems are trained on increasingly large and unvetted datasets, deployed with retrieval systems that access external knowledge, and embedded in agent pipelines that process environmental information, the attack surface for information poisoning grows substantially.

2. The Information Poisoning Threat Decomposition (IPTD)

The IPTD framework organizes poisoning threats by the stage of the AI deployment lifecycle they target:

IPTD: INFORMATION POISONING THREAT DECOMPOSITION Stage Attack Category Persistence Detection Difficulty ────────────────────────────────────────────────────────────────────────────── Pretraining Training data poison Very High Very High Fine-tuning Dataset poisoning High High RAG / Retrieval Knowledge base poison Medium Medium Agent pipelines Context manipulation Low-Medium Medium ────────────────────────────────────────────────────────────────────────────── Figure 1: IPTD Stage-Attack Matrix

3. Training Data Poisoning

3.1 Backdoor Attacks

Backdoor attacks insert trigger-response pairs into training data that cause the model to exhibit attacker-specified behavior when a trigger is present, while behaving normally otherwise. The trigger can be a specific phrase, token sequence, or even a stylistic feature. Because the backdoor is encoded in model weights, it persists through inference, deployment, and even some forms of fine-tuning.

Modern large-scale pretraining uses data scraped from the public web, creating massive attack surface: any actor who can publish web content can attempt to influence model training. While the scale of pretraining data provides some natural dilution resistance, targeted attacks with high-repetition poisoning have been shown effective at surprisingly low contamination rates.

3.2 Bias Injection

Bias injection poisons training data to systematically skew model outputs along demographic, political, or ideological dimensions. Unlike backdoor attacks that require specific triggers, bias injection produces systematic rather than conditional effects. Detection is more difficult because biased outputs may not be obviously anomalous—they may simply reflect a subtly shifted distribution of beliefs or associations.

3.3 Capability Degradation

Adversaries with access to training pipelines can introduce poisoned data designed to degrade specific capabilities—reducing performance on certain task types, languages, or domains. This attack is particularly relevant for AI systems trained or fine-tuned by organizations using external data suppliers or crowdsourced annotation pipelines.

4. Fine-Tuning Dataset Poisoning

Fine-tuning datasets are typically smaller than pretraining corpora, reducing the dilution protection of scale and making them more susceptible to targeted poisoning. RLHF preference datasets are particularly attractive targets: a small number of poisoned preference pairs can systematically shift reward model outputs, which in turn shift policy behavior.

Annotation pipeline attacks target the humans who generate fine-tuning data. By creating annotation tasks designed to elicit specific behaviors, attackers can influence training data generation without ever accessing the training pipeline directly. Supply chain attacks target data vendors, annotation platforms, or preprocessing pipelines to insert poison at scale.

5. RAG Knowledge Base Poisoning

Retrieval-augmented generation systems extend language models with external knowledge retrieved at inference time. The knowledge base—a corpus of documents indexed for semantic similarity search—represents a new attack surface: any document that can be inserted into or retrieved from the knowledge base can influence model outputs.

5.1 Document Injection

An attacker who can insert documents into a RAG knowledge base can cause the system to retrieve and incorporate adversarial information in response to targeted queries. This enables targeted disinformation (incorrect answers to specific questions), capability suppression (preventing accurate answers by flooding retrieval results with misleading documents), and indirect prompt injection (embedding adversarial instructions in retrieved documents).

5.2 Adversarial Document Crafting

Beyond inserting new documents, attackers can craft documents designed to score highly on semantic similarity to targeted queries while containing misleading or adversarial content. These documents exploit gaps between the semantic similarity metric used for retrieval and the factual content the metric is intended to proxy.

6. Agent Pipeline Context Poisoning

AI agents that operate in dynamic environments—reading files, browsing the web, querying databases, receiving tool outputs—process information from sources outside the developer's control. This creates persistent context poisoning opportunities: adversarial content in any information source the agent processes can influence agent behavior.

Unlike single-turn injection attacks, agent pipeline poisoning can persist across multiple turns by influencing the agent's accumulated context, memory systems, or planning outputs. An agent that reads a poisoned document early in a task may carry the influence of that document throughout the rest of the task, even after the document is no longer in its immediate context window.

7. Defenses

7.1 Data Provenance and Integrity

Maintaining cryptographic provenance chains for training data, fine-tuning datasets, and RAG knowledge base contents allows detection of unauthorized modifications and attribution of data sources. Provenance tracking is most valuable when combined with access controls that prevent unauthorized data insertion.

7.2 Anomaly Detection in Training Data

Statistical anomaly detection can identify data subsets that are outliers in semantic space, temporal distribution, or stylistic features—potential indicators of injected poison. Targeted detection methods specific to backdoor attacks, including activation pattern analysis and trigger scanning, provide additional coverage.

7.3 Robust Training Methods

Training methods designed for robustness to poisoning include: data sanitization pipelines that filter likely-poisoned examples; differential privacy training that limits the influence of any individual data point; and ensemble methods that reduce susceptibility to coordinated attacks on any single training run.

7.4 RAG Content Validation

Knowledge base content validation—including automated factuality checking, source credibility assessment, and adversarial content detection—reduces the impact of document injection attacks. Citation verification, where retrieved documents are cross-checked against authoritative sources before use, provides additional protection.

7.5 Agent Context Monitoring

Monitoring agent context for anomalous instruction patterns, unexpected information requests, and behavioral deviations from task scope can detect context poisoning attacks. Human-in-the-loop review at critical decision points provides a backstop for high-stakes agentic actions.

8. Conclusion

Information poisoning represents an underappreciated class of AI security threat. Unlike prompt injection attacks that are typically detected and mitigated at inference time, poisoning attacks operate upstream and can produce persistent, systematic effects that survive deployment. As AI systems are trained on larger and less curated datasets and deployed in increasingly agentic contexts, the threat surface for information poisoning expands. Developing comprehensive defenses requires a full lifecycle view of AI system security, from training data curation through deployment monitoring.

References

Bagdasaryan, E. et al. (2020). How to backdoor federated learning. AISTATS 2020.

Carlini, N. & Terzis, A. (2021). Poisoning and backdooring contrastive learning. ICLR 2022.

Chen, X. et al. (2017). Targeted backdoor attacks on deep learning systems using data poisoning. arXiv:1712.05526.

Garg, S. et al. (2020). Can adversarial weight perturbations inject neural backdoors? arXiv:2008.01761.

Greshake, K. et al. (2023). Not what you signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv:2302.12173.

Perez, E. et al. (2022). Red teaming language models with language models. arXiv:2202.03286.

Schuster, R. et al. (2021). You autocomplete me: Poisoning vulnerabilities in neural code completion. USENIX Security 2021.

Shafahi, A. et al. (2018). Poison frogs! Targeted clean-label poisoning attacks on neural networks. NeurIPS 2018.

Wallace, E. et al. (2021). Concealed data poisoning attacks on NLP models. NAACL 2021.

Wan, A. et al. (2023). Poisoning language models during instruction tuning. ICML 2023.