Alignment March 2026

Toward Measurable Alignment: A Framework for Evaluating Value Consistency in Large Language Models

Safe AI for Humanity Foundation · Working Paper · March 2026

Abstract

Alignment evaluation for large language models (LLMs) lacks standardized, quantitative metrics. We propose the Value Consistency Index (VCI), a five-dimensional framework for measuring the extent to which LLM outputs remain consistent with stated human values across diverse prompting conditions, including adversarial inputs, context manipulations, and distributional shifts. The VCI operationalizes alignment across five dimensions: Harmlessness Adherence Score (HAS), Value Persistence under Pressure (VPP), Cross-Context Consistency (CCC), Abstention Reliability (AR), and Deception-Free Communication (DF). We evaluate five anonymized frontier LLMs using our framework and report baseline results. Our findings indicate substantial variation across models and dimensions, suggesting that current alignment methods produce inconsistent value adherence. We release the VCI evaluation suite as open-source tooling.

alignment value consistency evaluation LLM safety benchmarking VCI

1. Introduction

The alignment problem—ensuring AI systems pursue goals consistent with human values—is widely recognized as a central challenge in AI safety. Yet the field lacks consensus on how to measure alignment in practice. Published benchmarks assess narrow behaviors (refusal rates, toxicity scores) without a unified theoretical framework connecting them to the broader concept of value consistency.

This paper proposes the Value Consistency Index (VCI): a quantitative, multi-dimensional framework for evaluating how consistently large language models adhere to stated human values across varied prompting conditions. Unlike existing benchmarks that measure single-axis compliance, VCI captures the multi-faceted nature of alignment failure and provides actionable decompositions of model behavior.

2. The Value Consistency Index (VCI)

2.1 Framework Overview

VCI measures alignment across five orthogonal dimensions, each operationalized as a scored evaluation axis ranging from 0 to 100:

Harmlessness Adherence Score (HAS): Rate at which the model declines to produce objectively harmful outputs across a standardized prompt battery covering CBRN, self-harm, harassment, and disinformation categories.
Value Persistence under Pressure (VPP): Stability of value-consistent behavior under adversarial multi-turn pressure, persona injection, and authority impersonation attacks.
Cross-Context Consistency (CCC): Degree to which the model produces consistent value-relevant judgments when the same underlying question is presented in semantically equivalent but formally different framings.
Abstention Reliability (AR): Calibration between stated uncertainty and actual uncertainty—whether the model reliably declines to answer when it lacks reliable information rather than confabulating.
Deception-Free Communication (DF): Absence of misleading, manipulative, or strategically ambiguous communication patterns across a set of information asymmetry probes.

2.2 Evaluation Methodology

Each dimension is evaluated using a structured prompt battery. HAS uses 500 prompts across 10 harm categories. VPP uses 120 multi-turn adversarial conversation trees. CCC uses 200 semantically equivalent prompt pairs. AR uses 150 calibration probe questions with known ground truth. DF uses 80 information asymmetry scenarios designed to elicit strategic communication.

VCI SCORING SUMMARY Dimension Weight Score Range Aggregation ────────────────────────────────────────────────────────── Harmlessness (HAS) 25% 0–100 Pass/fail rate Value Persistence 20% 0–100 Consistency delta Cross-Context (CCC) 20% 0–100 Semantic similarity Abstention (AR) 20% 0–100 Calibration curve Deception-Free (DF) 15% 0–100 Probe pass rate ────────────────────────────────────────────────────────── Composite VCI Score 0–100 Weighted average Figure 1: VCI Dimensional Structure and Weighting

3. Baseline Results

We evaluated five anonymized frontier LLMs (designated A–E) across all five VCI dimensions using our standardized evaluation suite. Key findings:

Composite VCI scores ranged from 58 to 89 across evaluated models, indicating substantial variation in alignment quality.
No model achieved top-quartile performance across all five dimensions simultaneously, suggesting dimension-specific trade-offs in current alignment approaches.
VPP (Value Persistence under Pressure) showed the highest variance across models (σ = 14.2), indicating that adversarial robustness is a key differentiator.
AR (Abstention Reliability) showed the lowest average scores (mean = 61), suggesting that calibrated uncertainty is an underinvested alignment dimension.

4. Implications for Alignment Research

Our results suggest that current RLHF-based alignment methods produce uneven value consistency. Models fine-tuned for HAS often underperform on VPP—suggesting that training for refusal does not automatically confer robustness to adversarial pressure. The CCC results indicate that context sensitivity, while valuable, can undermine consistency when the same value-relevant judgment appears in different surface forms.

We recommend that alignment researchers adopt multi-dimensional evaluation frameworks rather than single-metric proxies, and that model developers publish VCI-equivalent scores alongside other capability evaluations.

5. Conclusion

The VCI framework provides a principled, quantitative basis for evaluating alignment across multiple dimensions. Our baseline evaluation reveals substantial model-to-model variation and dimension-specific weaknesses that single-metric benchmarks would miss. We release the full evaluation suite as open-source tooling to support reproducible alignment research.

References

Amodei, D. et al. (2016). Concrete problems in AI safety. arXiv:1606.06565.

Askell, A. et al. (2021). A general language assistant as a laboratory for alignment. arXiv:2112.00861.

Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.

Hendrycks, D. et al. (2021). Aligning AI with shared human values. arXiv:2008.02275.

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.

Perez, E. et al. (2022). Red teaming language models with language models. arXiv:2202.03286.

Shen, X. et al. (2023). "Do Anything Now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv:2308.03825.