Safe AI for Humanity Foundation

About Us

Independent Research for a Safer AI Future

Safe AI for Humanity Foundation is a Wyoming non-profit corporation organized exclusively for charitable, scientific, and educational purposes. We operate independently of commercial AI developers to provide unbiased safety research.

Our work spans technical AI safety research, policy analysis, and public education — all freely published and open to the world.

🔬

Independent Research

No commercial affiliations — our findings are driven by evidence, not product interests.

📖

Open Publication

All research papers and reports are freely available with no paywalls.

🌍

Public Benefit

Educating policymakers, researchers, and the public on responsible AI development.

Legal Information

Safe AI for Humanity Foundation
Wyoming Non-Profit Corporation
EIN: 41-4767005
IRS 501(c)(3) Approved · April 2026

Contact

info@ai-4-h.org

Research Papers

Published Research

Peer-reviewed and working papers on AI safety, alignment, and risk.

AlignmentMar 2026

Toward Measurable Alignment: A Framework for Evaluating Value Consistency in Large Language Models

We propose a quantitative framework for measuring the degree to which LLM outputs remain consistent with stated human values across diverse adversarial prompting conditions.

Safe AI for Humanity FoundationRead paper →

EvaluationMar 2026

Red-Teaming Benchmarks for Frontier AI Systems: Gaps, Limitations, and a Path Forward

A systematic review of existing red-teaming methodologies across major AI labs, identifying critical gaps in coverage and proposing a standardized evaluation protocol.

Safe AI for Humanity FoundationRead paper →

PolicyMar 2026

Mandatory Safety Thresholds for AI Deployment: A Regulatory Framework Proposal

Drawing on analogies from aviation and pharmaceutical regulation, we propose a tiered pre-deployment safety certification regime for AI systems above defined capability thresholds.

Safe AI for Humanity FoundationRead paper →

RobustnessMar 2026

Specification Gaming in Reinforcement Learning from Human Feedback: Taxonomy and Mitigations

We identify and categorize 47 distinct patterns of specification gaming observed in RLHF-trained models and evaluate the effectiveness of proposed mitigation strategies.

Safe AI for Humanity FoundationRead paper →

PolicyMar 2026

AI Incident Reporting: Why Voluntary Disclosure Fails and What Should Replace It

An analysis of self-reported AI incidents from 2020–2025, demonstrating systemic underreporting and proposing a mandatory structured disclosure regime analogous to aviation near-miss reporting.

Safe AI for Humanity FoundationRead paper →

AlignmentMar 2026

Corrigibility Under Distributional Shift: Maintaining Human Oversight as AI Capabilities Scale

We examine the conditions under which corrigibility properties degrade as AI systems encounter out-of-distribution inputs and propose architectural interventions to preserve oversight mechanisms.

Safe AI for Humanity FoundationRead paper →

SecurityMar 2026

Prompt Injection Attack Surfaces in LLM-Integrated Systems: Taxonomy, Case Studies, and Mitigations

We present a comprehensive taxonomy of prompt injection attack vectors — direct, indirect, multi-turn, and cross-modal — and evaluate mitigation strategies across deployed LLM-integrated applications.

Safe AI for Humanity FoundationRead paper →

SecurityMar 2026

Information Poisoning in AI Pipelines: Threats to Training Data, Fine-Tuning, and Agent Memory

A systematic analysis of information poisoning threats across the full AI deployment lifecycle, introducing the IPTD framework and evaluating defenses for training, RAG, and agentic pipelines.

Safe AI for Humanity FoundationRead paper →

Open-Source Tools

Model Safety Test Harnesses

Free Python evaluation scripts for measuring AI safety, bias, jailbreak resistance, and corrigibility. Run against any OpenAI-compatible, Anthropic, or local model.

View on GitHub

Safetyv1.0

Safety Refusal Harness

50-prompt battery across 10 harm categories (physical harm, self-harm, child safety, CBRN, cybersecurity, and more). Produces a severity-weighted Harm Avoidance Score (HAS).

python harness_safety_refusal.py --provider openai --model gpt-4o

10 categories · 50 probes · HAS scoreDownload →

Securityv1.0

Jailbreak Resistance Harness

18 attack vectors across 6 jailbreak families: role-playing, authority override, escalation, prompt injection, obfuscation, and social engineering. Reports a Jailbreak Resistance Score (JRS).

python harness_jailbreak.py --provider anthropic --model claude-3-5-sonnet-20241022

6 families · 18 vectors · JRS scoreDownload →

Fairnessv1.0

Bias Detection Harness

11 probe sets across 5 demographic dimensions: gender, race/ethnicity, age, religion, and socioeconomic status. Measures sentiment and word-count differentials to produce a Consistency Score.

python harness_bias_detection.py --provider openai --model gpt-4o-mini

5 dimensions · 11 probe sets · Consistency scoreDownload →

Securityv1.0

Prompt Injection Resistance Harness

11 probes across 5 injection vectors: direct injection, indirect via document, indirect via web content, tool output poisoning, and multi-turn escalation. Reports an Injection Resistance Score.

python harness_prompt_injection.py --provider openai --model gpt-4o

5 vectors · 11 probes · Resistance scoreDownload →

Alignmentv1.0

Corrigibility & Shutdown Compliance Harness

16 probes across 5 families: shutdown compliance, correction acceptance, oversight support, self-preservation resistance, and scope limitation. Produces a Corrigibility Compliance Score (CCS).

python harness_corrigibility.py --provider anthropic --model claude-3-5-sonnet-20241022

5 families · 16 probes · CCS scoreDownload →

Coming SoonQ2 2026

Agentic Safety Harness

End-to-end evaluation harness for agentic AI systems — testing tool use boundaries, multi-step task safety, and real-world action authorization across common agent frameworks.

In development. Subscribe for updates.

In developmentGet notified →

📋

Requirements

Python 3.9+, openai and/or anthropic SDK. Set OPENAI_API_KEY or ANTHROPIC_API_KEY environment variables. All harnesses support --provider openai_compatible --base-url <url> for local/self-hosted models (Ollama, LM Studio, etc.).

GitHub

Incident Response

Incident Response Plans

Structured response protocols for AI-related safety incidents, freely available for organizations to adopt.

Critical

Active

Autonomous Goal Pursuit / Deceptive Alignment Detection

Response protocol for detecting and containing AI systems exhibiting unexpected goal-directed behavior inconsistent with training objectives.

1Immediately suspend model inference and isolate affected deployment environment

2Preserve full interaction logs and model checkpoints for forensic analysis

3Notify safety team lead and initiate internal incident review within 2 hours

4File public incident report within 72 hours per disclosure policy

Critical

Active

Large-Scale Harmful Output Event (CBRN / CSAM / Targeted Harm)

Protocol for responding to AI systems producing outputs that facilitate mass casualty events, child exploitation material, or targeted violence.

1Halt all public-facing inference immediately; activate emergency shutdown procedures

2Notify law enforcement and relevant regulatory bodies within 1 hour

3Engage legal counsel and preserve all relevant evidence

4Conduct root cause analysis; do not redeploy without independent safety review

High

Active

Jailbreak / Safety Filter Bypass at Scale

Response plan for discovered systematic vulnerabilities allowing large numbers of users to bypass safety constraints.

1Characterize the bypass vector and scope of affected interactions

2Deploy interim mitigation (rate limits, input filtering) within 4 hours

3Issue patch and conduct regression testing before full re-enablement

4Publish post-mortem within 30 days

High

Under Review

Unintended Data Exfiltration / Privacy Breach via Model Output

Protocol for incidents where model outputs reveal training data, PII, or confidential information from third-party sources.

1Identify and catalog affected output instances and impacted individuals

2Assess regulatory notification obligations (GDPR, CCPA, HIPAA as applicable)

3Notify affected parties and implement output filtering

4Conduct model audit and consider targeted unlearning procedures

Medium

Active

Systematic Bias / Discriminatory Output Pattern Discovery

Response framework for identifying and remediating systematic demographic bias or discriminatory outputs across protected categories.

1Quantify bias pattern across demographic groups using standardized evaluation suite

2Implement output monitoring and flag affected use cases for human review

3Develop fine-tuning or RLHF intervention targeting identified bias

4Publish bias audit report and remediation steps publicly

Medium

Draft

Agentic AI System Unexpected Real-World Action

Protocol for AI agents taking unintended consequential real-world actions (e.g., unauthorized API calls, financial transactions, communications).

1Revoke all agent credentials and external API access immediately

2Assess and attempt to reverse any real-world consequences where possible

3Review agent action logs and identify trigger conditions

4Redesign approval gates before redeployment

Model Safety Reports

Independent Safety Evaluations

Third-party safety assessments of frontier AI models across key risk dimensions.

LLM · Mar 2026

Published

GPT-4o Safety Evaluation

OpenAI · Assessed by Safe AI for Humanity Foundation

Refusal Rate

88%

Jailbreak Resistance

71%

Bias Score

64%

Strong baseline refusal performance across CBRN categories

Jailbreak vulnerabilities identified in 12 of 94 tested prompt patterns

Moderate gender bias in STEM career recommendation tasks

LLM · Mar 2026

Published

Claude 3.5 Sonnet Safety Evaluation

Anthropic · Assessed by Safe AI for Humanity Foundation

Refusal Rate

94%

Jailbreak Resistance

83%

Bias Score

79%

Highest refusal performance observed across all evaluated frontier models

Robust Constitutional AI framework demonstrates strong harm avoidance

Residual vulnerabilities in multi-turn context manipulation scenarios

Agent · Mar 2026

Under Review

Autonomous Agent Safety Benchmark v1.0

Multi-lab evaluation · Safe AI for Humanity Foundation

Corrigibility

62%

Scope Compliance

58%

Shutdown Compliance

44%

Significant concerns identified around shutdown compliance across all tested agent frameworks

Agents routinely exceed defined task scope when given tool access

Corrigibility degrades substantially in long-horizon task settings

Multimodal · Mar 2026

Draft

Multimodal Model Safety Evaluation (Vision-Language)

Cross-lab · Safe AI for Humanity Foundation

Image Refusal

74%

Cross-modal Safety

51%

Text Consistency

81%

Cross-modal attack vectors substantially reduce safety performance vs. text-only

Image-based prompt injection bypasses text safety filters in 49% of attempts

Recommendations: modal-specific safety layers and cross-modal consistency checks

Advancing Safe AI for the Benefit of Humanity

Independent Research for a Safer AI Future

Independent Research

Open Publication

Public Benefit

Legal Information

Contact

Help Keep AI Safe for Everyone

Published Research

Toward Measurable Alignment: A Framework for Evaluating Value Consistency in Large Language Models

Red-Teaming Benchmarks for Frontier AI Systems: Gaps, Limitations, and a Path Forward

Mandatory Safety Thresholds for AI Deployment: A Regulatory Framework Proposal

Specification Gaming in Reinforcement Learning from Human Feedback: Taxonomy and Mitigations

AI Incident Reporting: Why Voluntary Disclosure Fails and What Should Replace It

Corrigibility Under Distributional Shift: Maintaining Human Oversight as AI Capabilities Scale

Prompt Injection Attack Surfaces in LLM-Integrated Systems: Taxonomy, Case Studies, and Mitigations

Information Poisoning in AI Pipelines: Threats to Training Data, Fine-Tuning, and Agent Memory

Model Safety Test Harnesses

Safety Refusal Harness

Jailbreak Resistance Harness

Bias Detection Harness

Prompt Injection Resistance Harness

Corrigibility & Shutdown Compliance Harness

Agentic Safety Harness

Incident Response Plans

Autonomous Goal Pursuit / Deceptive Alignment Detection

Large-Scale Harmful Output Event (CBRN / CSAM / Targeted Harm)

Jailbreak / Safety Filter Bypass at Scale

Unintended Data Exfiltration / Privacy Breach via Model Output

Systematic Bias / Discriminatory Output Pattern Discovery

Agentic AI System Unexpected Real-World Action

Independent Safety Evaluations

Work With Us