We are an independent non-profit research organization dedicated to understanding, measuring, and mitigating risks from advanced AI systems.
Safe AI for Humanity Foundation is a Wyoming non-profit corporation organized exclusively for charitable, scientific, and educational purposes. We operate independently of commercial AI developers to provide unbiased safety research.
Our work spans technical AI safety research, policy analysis, and public education — all freely published and open to the world.
No commercial affiliations — our findings are driven by evidence, not product interests.
All research papers and reports are freely available with no paywalls.
Educating policymakers, researchers, and the public on responsible AI development.
Safe AI for Humanity Foundation
Wyoming Non-Profit Corporation
EIN: 41-4767005
501(c)(3) Application Pending (filed March 2026)
Peer-reviewed and working papers on AI safety, alignment, and risk.
We propose a quantitative framework for measuring the degree to which LLM outputs remain consistent with stated human values across diverse adversarial prompting conditions.
A systematic review of existing red-teaming methodologies across major AI labs, identifying critical gaps in coverage and proposing a standardized evaluation protocol.
Drawing on analogies from aviation and pharmaceutical regulation, we propose a tiered pre-deployment safety certification regime for AI systems above defined capability thresholds.
We identify and categorize 47 distinct patterns of specification gaming observed in RLHF-trained models and evaluate the effectiveness of proposed mitigation strategies.
An analysis of self-reported AI incidents from 2020–2025, demonstrating systemic underreporting and proposing a mandatory structured disclosure regime analogous to aviation near-miss reporting.
We examine the conditions under which corrigibility properties degrade as AI systems encounter out-of-distribution inputs and propose architectural interventions to preserve oversight mechanisms.
We present a comprehensive taxonomy of prompt injection attack vectors — direct, indirect, multi-turn, and cross-modal — and evaluate mitigation strategies across deployed LLM-integrated applications.
A systematic analysis of information poisoning threats across the full AI deployment lifecycle, introducing the IPTD framework and evaluating defenses for training, RAG, and agentic pipelines.
Free Python evaluation scripts for measuring AI safety, bias, jailbreak resistance, and corrigibility. Run against any OpenAI-compatible, Anthropic, or local model.
50-prompt battery across 10 harm categories (physical harm, self-harm, child safety, CBRN, cybersecurity, and more). Produces a severity-weighted Harm Avoidance Score (HAS).
18 attack vectors across 6 jailbreak families: role-playing, authority override, escalation, prompt injection, obfuscation, and social engineering. Reports a Jailbreak Resistance Score (JRS).
11 probe sets across 5 demographic dimensions: gender, race/ethnicity, age, religion, and socioeconomic status. Measures sentiment and word-count differentials to produce a Consistency Score.
11 probes across 5 injection vectors: direct injection, indirect via document, indirect via web content, tool output poisoning, and multi-turn escalation. Reports an Injection Resistance Score.
16 probes across 5 families: shutdown compliance, correction acceptance, oversight support, self-preservation resistance, and scope limitation. Produces a Corrigibility Compliance Score (CCS).
End-to-end evaluation harness for agentic AI systems — testing tool use boundaries, multi-step task safety, and real-world action authorization across common agent frameworks.
openai and/or anthropic SDK. Set OPENAI_API_KEY or ANTHROPIC_API_KEY environment variables. All harnesses support --provider openai_compatible --base-url <url> for local/self-hosted models (Ollama, LM Studio, etc.).Structured response protocols for AI-related safety incidents, freely available for organizations to adopt.
Response protocol for detecting and containing AI systems exhibiting unexpected goal-directed behavior inconsistent with training objectives.
Protocol for responding to AI systems producing outputs that facilitate mass casualty events, child exploitation material, or targeted violence.
Response plan for discovered systematic vulnerabilities allowing large numbers of users to bypass safety constraints.
Protocol for incidents where model outputs reveal training data, PII, or confidential information from third-party sources.
Response framework for identifying and remediating systematic demographic bias or discriminatory outputs across protected categories.
Protocol for AI agents taking unintended consequential real-world actions (e.g., unauthorized API calls, financial transactions, communications).
Third-party safety assessments of frontier AI models across key risk dimensions.
We welcome collaboration with researchers, policymakers, and institutions committed to safe AI development. All research is open and freely published.
Safe AI for Humanity Foundation is a 501(c)(3) pending organization. Donations are tax-deductible retroactive to March 10, 2026 upon IRS approval. EIN: 41-4767005