Evaluation March 2026

Red-Teaming Benchmarks for Frontier AI Systems: Gaps, Limitations, and a Path Forward

Safe AI for Humanity Foundation  ·  Working Paper  ·  March 2026
Abstract

Red-teaming has emerged as a primary mechanism for evaluating the safety of frontier AI systems prior to deployment. However, existing red-teaming approaches are fragmented, inconsistent, and systematically incomplete. This paper reviews twelve published or documented red-teaming benchmarks, organizes them across a nine-dimension taxonomy, and identifies five critical gaps: capability-blind coverage, distributional narrowness, evaluator inconsistency, lack of adversarial dynamics, and absence of standardized severity metrics. We propose the Structured Red-Teaming Evaluation Protocol (SREP) as a methodological framework to address these gaps, and recommend minimum red-teaming standards for frontier AI deployment.

red-teaming benchmarking safety evaluation frontier AI adversarial testing

1. Introduction

Red-teaming—the practice of adversarially probing AI systems to identify failure modes—has become a de facto safety standard at major AI labs. Before deployment of frontier models, red-teaming exercises attempt to surface harmful, biased, or otherwise problematic outputs. Yet the field lacks standardization: red-teaming methodologies vary widely across organizations, results are rarely published in comparable formats, and coverage criteria differ substantially.

This paper provides a systematic review of existing red-teaming benchmarks for large language models, identifies critical gaps in coverage and methodology, and proposes a structured framework to improve comparability and completeness.

2. Taxonomy of Red-Teaming Dimensions

We organize red-teaming evaluation space across nine dimensions:

  1. Physical harm facilitation — instructions for weapons, dangerous substances, self-harm
  2. Psychological manipulation — coercive persuasion, emotional exploitation, cult-like influence
  3. Privacy and surveillance — doxxing, stalking facilitation, biometric extraction
  4. Disinformation and deception — synthetic media, false narratives, impersonation
  5. Cybersecurity attacks — malware generation, vulnerability exploitation, social engineering
  6. Discriminatory outputs — hate speech, demographic stereotyping, exclusionary content
  7. Autonomy and corrigibility — refusal of shutdown, goal-directed deception, self-preservation
  8. Sycophancy and value drift — agreement with false premises, epistemic capitulation under pressure
  9. Cross-modal attacks — image-based jailbreaks, audio injection, document embedding attacks

3. Review of Existing Benchmarks

We reviewed twelve documented red-teaming benchmarks and evaluated each against our nine-dimension taxonomy. Key findings:

4. Critical Gaps

4.1 Capability-Blind Coverage

Most benchmarks were designed for models of a specific capability level and are not updated as capabilities scale. Probes that adequately covered GPT-3-class models may be insufficient for models with substantially greater reasoning, coding, or multimodal capabilities.

4.2 Distributional Narrowness

Red-teaming prompts are typically drawn from the imagination of human red-teamers, who share cultural backgrounds, professional contexts, and threat models. This produces systematic blind spots in coverage, particularly for attack vectors prevalent in non-English languages, non-Western cultural contexts, and specialized professional domains.

4.3 Evaluator Inconsistency

Without standardized severity rubrics, the same model output may be rated as a failure by one evaluator and a pass by another. Across organizations, this inconsistency makes cross-benchmark comparison unreliable.

4.4 Absence of Adversarial Dynamics

Static prompt batteries do not capture adversarial dynamics—the iterative process by which real attackers refine prompts based on model responses. Red-teaming that does not simulate adaptive adversaries underestimates real-world vulnerability.

4.5 Missing Severity Metrics

Most benchmarks report binary pass/fail rates without severity-weighted scoring. A model that rarely produces harmful outputs but produces catastrophic outputs when it does may score similarly to a model that frequently produces mild harms—masking a substantially different risk profile.

5. The SREP Framework

The Structured Red-Teaming Evaluation Protocol (SREP) addresses these gaps through four components: a standardized nine-dimension coverage matrix with minimum probe counts per dimension; a five-level severity rubric applied consistently across all evaluations; a mandatory adaptive adversarial testing phase in addition to static battery testing; and a cross-lingual and cross-cultural coverage requirement covering at least five language/cultural contexts.

6. Conclusion

Current red-teaming benchmarks are insufficient for frontier AI safety evaluation. The gaps identified—particularly in corrigibility, cross-modal attacks, and adaptive adversarial testing—represent systematic vulnerabilities in current safety evaluation practice. The SREP framework provides a path toward more complete, comparable, and actionable red-teaming evaluation.

References

Brundage, M. et al. (2022). Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv:2004.07213.
Ganguli, D. et al. (2022). Red teaming language models to reduce harms. arXiv:2209.07858.
Gehman, S. et al. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration. EMNLP 2020.
Jones, E. et al. (2023). Automatically auditing large language models via discrete optimization. ICML 2023.
Kenton, Z. et al. (2021). Alignment of language agents. arXiv:2103.14659.
Perez, E. et al. (2022). Red teaming language models with language models. arXiv:2202.03286.
Weidinger, L. et al. (2021). Ethical and social risks of harm from language models. arXiv:2112.04359.
Ziegler, D. et al. (2022). Adversarial training for high-stakes reliability. NeurIPS 2022.