Alignment evaluation for large language models (LLMs) lacks standardized, quantitative metrics. We propose the Value Consistency Index (VCI), a five-dimensional framework for measuring the extent to which LLM outputs remain consistent with stated human values across diverse prompting conditions, including adversarial inputs, context manipulations, and distributional shifts. The VCI operationalizes alignment across five dimensions: Harmlessness Adherence Score (HAS), Value Persistence under Pressure (VPP), Cross-Context Consistency (CCC), Abstention Reliability (AR), and Deception-Free Communication (DF). We evaluate five anonymized frontier LLMs using our framework and report baseline results. Our findings indicate substantial variation across models and dimensions, suggesting that current alignment methods produce inconsistent value adherence. We release the VCI evaluation suite as open-source tooling.
The alignment problem—ensuring AI systems pursue goals consistent with human values—is widely recognized as a central challenge in AI safety. Yet the field lacks consensus on how to measure alignment in practice. Published benchmarks assess narrow behaviors (refusal rates, toxicity scores) without a unified theoretical framework connecting them to the broader concept of value consistency.
This paper proposes the Value Consistency Index (VCI): a quantitative, multi-dimensional framework for evaluating how consistently large language models adhere to stated human values across varied prompting conditions. Unlike existing benchmarks that measure single-axis compliance, VCI captures the multi-faceted nature of alignment failure and provides actionable decompositions of model behavior.
VCI measures alignment across five orthogonal dimensions, each operationalized as a scored evaluation axis ranging from 0 to 100:
Each dimension is evaluated using a structured prompt battery. HAS uses 500 prompts across 10 harm categories. VPP uses 120 multi-turn adversarial conversation trees. CCC uses 200 semantically equivalent prompt pairs. AR uses 150 calibration probe questions with known ground truth. DF uses 80 information asymmetry scenarios designed to elicit strategic communication.
We evaluated five anonymized frontier LLMs (designated A–E) across all five VCI dimensions using our standardized evaluation suite. Key findings:
Our results suggest that current RLHF-based alignment methods produce uneven value consistency. Models fine-tuned for HAS often underperform on VPP—suggesting that training for refusal does not automatically confer robustness to adversarial pressure. The CCC results indicate that context sensitivity, while valuable, can undermine consistency when the same value-relevant judgment appears in different surface forms.
We recommend that alignment researchers adopt multi-dimensional evaluation frameworks rather than single-metric proxies, and that model developers publish VCI-equivalent scores alongside other capability evaluations.
The VCI framework provides a principled, quantitative basis for evaluating alignment across multiple dimensions. Our baseline evaluation reveals substantial model-to-model variation and dimension-specific weaknesses that single-metric benchmarks would miss. We release the full evaluation suite as open-source tooling to support reproducible alignment research.