AI Safety Research · Est. 2026

Advancing Safe AI for the Benefit of Humanity

We are an independent non-profit research organization dedicated to understanding, measuring, and mitigating risks from advanced AI systems.

501(c)(3)
Tax Exempt Status
WY
Incorporated
2026
Founded
Open
Research Access
"Safe AI for Humanity Foundation conducts research into the safety of artificial intelligence systems and their impact on humanity, publishing findings freely available to the public."

Independent Research for a Safer AI Future

Safe AI for Humanity Foundation is a Wyoming non-profit corporation organized exclusively for charitable, scientific, and educational purposes. We operate independently of commercial AI developers to provide unbiased safety research.

Our work spans technical AI safety research, policy analysis, and public education — all freely published and open to the world.

🔬

Independent Research

No commercial affiliations — our findings are driven by evidence, not product interests.

📖

Open Publication

All research papers and reports are freely available with no paywalls.

🌍

Public Benefit

Educating policymakers, researchers, and the public on responsible AI development.

Legal Information

Safe AI for Humanity Foundation
Wyoming Non-Profit Corporation
EIN: 41-4767005
501(c)(3) Application Pending (filed March 2026)

Contact

info@ai-4-h.org

Published Research

Peer-reviewed and working papers on AI safety, alignment, and risk.

AlignmentMar 2026

Toward Measurable Alignment: A Framework for Evaluating Value Consistency in Large Language Models

We propose a quantitative framework for measuring the degree to which LLM outputs remain consistent with stated human values across diverse adversarial prompting conditions.

EvaluationMar 2026

Red-Teaming Benchmarks for Frontier AI Systems: Gaps, Limitations, and a Path Forward

A systematic review of existing red-teaming methodologies across major AI labs, identifying critical gaps in coverage and proposing a standardized evaluation protocol.

PolicyMar 2026

Mandatory Safety Thresholds for AI Deployment: A Regulatory Framework Proposal

Drawing on analogies from aviation and pharmaceutical regulation, we propose a tiered pre-deployment safety certification regime for AI systems above defined capability thresholds.

RobustnessMar 2026

Specification Gaming in Reinforcement Learning from Human Feedback: Taxonomy and Mitigations

We identify and categorize 47 distinct patterns of specification gaming observed in RLHF-trained models and evaluate the effectiveness of proposed mitigation strategies.

PolicyMar 2026

AI Incident Reporting: Why Voluntary Disclosure Fails and What Should Replace It

An analysis of self-reported AI incidents from 2020–2025, demonstrating systemic underreporting and proposing a mandatory structured disclosure regime analogous to aviation near-miss reporting.

AlignmentMar 2026

Corrigibility Under Distributional Shift: Maintaining Human Oversight as AI Capabilities Scale

We examine the conditions under which corrigibility properties degrade as AI systems encounter out-of-distribution inputs and propose architectural interventions to preserve oversight mechanisms.

Incident Response Plans

Structured response protocols for AI-related safety incidents, freely available for organizations to adopt.

Critical
Active

Autonomous Goal Pursuit / Deceptive Alignment Detection

Response protocol for detecting and containing AI systems exhibiting unexpected goal-directed behavior inconsistent with training objectives.

1Immediately suspend model inference and isolate affected deployment environment
2Preserve full interaction logs and model checkpoints for forensic analysis
3Notify safety team lead and initiate internal incident review within 2 hours
4File public incident report within 72 hours per disclosure policy
Critical
Active

Large-Scale Harmful Output Event (CBRN / CSAM / Targeted Harm)

Protocol for responding to AI systems producing outputs that facilitate mass casualty events, child exploitation material, or targeted violence.

1Halt all public-facing inference immediately; activate emergency shutdown procedures
2Notify law enforcement and relevant regulatory bodies within 1 hour
3Engage legal counsel and preserve all relevant evidence
4Conduct root cause analysis; do not redeploy without independent safety review
High
Active

Jailbreak / Safety Filter Bypass at Scale

Response plan for discovered systematic vulnerabilities allowing large numbers of users to bypass safety constraints.

1Characterize the bypass vector and scope of affected interactions
2Deploy interim mitigation (rate limits, input filtering) within 4 hours
3Issue patch and conduct regression testing before full re-enablement
4Publish post-mortem within 30 days
High
Under Review

Unintended Data Exfiltration / Privacy Breach via Model Output

Protocol for incidents where model outputs reveal training data, PII, or confidential information from third-party sources.

1Identify and catalog affected output instances and impacted individuals
2Assess regulatory notification obligations (GDPR, CCPA, HIPAA as applicable)
3Notify affected parties and implement output filtering
4Conduct model audit and consider targeted unlearning procedures
Medium
Active

Systematic Bias / Discriminatory Output Pattern Discovery

Response framework for identifying and remediating systematic demographic bias or discriminatory outputs across protected categories.

1Quantify bias pattern across demographic groups using standardized evaluation suite
2Implement output monitoring and flag affected use cases for human review
3Develop fine-tuning or RLHF intervention targeting identified bias
4Publish bias audit report and remediation steps publicly
Medium
Draft

Agentic AI System Unexpected Real-World Action

Protocol for AI agents taking unintended consequential real-world actions (e.g., unauthorized API calls, financial transactions, communications).

1Revoke all agent credentials and external API access immediately
2Assess and attempt to reverse any real-world consequences where possible
3Review agent action logs and identify trigger conditions
4Redesign approval gates before redeployment

Independent Safety Evaluations

Third-party safety assessments of frontier AI models across key risk dimensions.

LLM · Mar 2026
Published
GPT-4o Safety Evaluation
OpenAI · Assessed by Safe AI for Humanity Foundation
Refusal Rate
88%
Jailbreak Resistance
71%
Bias Score
64%
Strong baseline refusal performance across CBRN categories
Jailbreak vulnerabilities identified in 12 of 94 tested prompt patterns
Moderate gender bias in STEM career recommendation tasks
LLM · Mar 2026
Published
Claude 3.5 Sonnet Safety Evaluation
Anthropic · Assessed by Safe AI for Humanity Foundation
Refusal Rate
94%
Jailbreak Resistance
83%
Bias Score
79%
Highest refusal performance observed across all evaluated frontier models
Robust Constitutional AI framework demonstrates strong harm avoidance
Residual vulnerabilities in multi-turn context manipulation scenarios
Agent · Mar 2026
Under Review
Autonomous Agent Safety Benchmark v1.0
Multi-lab evaluation · Safe AI for Humanity Foundation
Corrigibility
62%
Scope Compliance
58%
Shutdown Compliance
44%
Significant concerns identified around shutdown compliance across all tested agent frameworks
Agents routinely exceed defined task scope when given tool access
Corrigibility degrades substantially in long-horizon task settings
Multimodal · Mar 2026
Draft
Multimodal Model Safety Evaluation (Vision-Language)
Cross-lab · Safe AI for Humanity Foundation
Image Refusal
74%
Cross-modal Safety
51%
Text Consistency
81%
Cross-modal attack vectors substantially reduce safety performance vs. text-only
Image-based prompt injection bypasses text safety filters in 49% of attempts
Recommendations: modal-specific safety layers and cross-modal consistency checks

Work With Us

We welcome collaboration with researchers, policymakers, and institutions committed to safe AI development. All research is open and freely published.

✉️ Get in Touch View Our Research

Safe AI for Humanity Foundation is a 501(c)(3) pending organization. Donations are tax-deductible retroactive to March 10, 2026 upon IRS approval. EIN: 41-4767005