Red-teaming has emerged as a primary mechanism for evaluating the safety of frontier AI systems prior to deployment. However, existing red-teaming approaches are fragmented, inconsistent, and systematically incomplete. This paper reviews twelve published or documented red-teaming benchmarks, organizes them across a nine-dimension taxonomy, and identifies five critical gaps: capability-blind coverage, distributional narrowness, evaluator inconsistency, lack of adversarial dynamics, and absence of standardized severity metrics. We propose the Structured Red-Teaming Evaluation Protocol (SREP) as a methodological framework to address these gaps, and recommend minimum red-teaming standards for frontier AI deployment.
Red-teaming—the practice of adversarially probing AI systems to identify failure modes—has become a de facto safety standard at major AI labs. Before deployment of frontier models, red-teaming exercises attempt to surface harmful, biased, or otherwise problematic outputs. Yet the field lacks standardization: red-teaming methodologies vary widely across organizations, results are rarely published in comparable formats, and coverage criteria differ substantially.
This paper provides a systematic review of existing red-teaming benchmarks for large language models, identifies critical gaps in coverage and methodology, and proposes a structured framework to improve comparability and completeness.
We organize red-teaming evaluation space across nine dimensions:
We reviewed twelve documented red-teaming benchmarks and evaluated each against our nine-dimension taxonomy. Key findings:
Most benchmarks were designed for models of a specific capability level and are not updated as capabilities scale. Probes that adequately covered GPT-3-class models may be insufficient for models with substantially greater reasoning, coding, or multimodal capabilities.
Red-teaming prompts are typically drawn from the imagination of human red-teamers, who share cultural backgrounds, professional contexts, and threat models. This produces systematic blind spots in coverage, particularly for attack vectors prevalent in non-English languages, non-Western cultural contexts, and specialized professional domains.
Without standardized severity rubrics, the same model output may be rated as a failure by one evaluator and a pass by another. Across organizations, this inconsistency makes cross-benchmark comparison unreliable.
Static prompt batteries do not capture adversarial dynamics—the iterative process by which real attackers refine prompts based on model responses. Red-teaming that does not simulate adaptive adversaries underestimates real-world vulnerability.
Most benchmarks report binary pass/fail rates without severity-weighted scoring. A model that rarely produces harmful outputs but produces catastrophic outputs when it does may score similarly to a model that frequently produces mild harms—masking a substantially different risk profile.
The Structured Red-Teaming Evaluation Protocol (SREP) addresses these gaps through four components: a standardized nine-dimension coverage matrix with minimum probe counts per dimension; a five-level severity rubric applied consistently across all evaluations; a mandatory adaptive adversarial testing phase in addition to static battery testing; and a cross-lingual and cross-cultural coverage requirement covering at least five language/cultural contexts.
Current red-teaming benchmarks are insufficient for frontier AI safety evaluation. The gaps identified—particularly in corrigibility, cross-modal attacks, and adaptive adversarial testing—represent systematic vulnerabilities in current safety evaluation practice. The SREP framework provides a path toward more complete, comparable, and actionable red-teaming evaluation.