AI Glossary: Safety, Alignment & Evaluation

AI Glossary: Safety, Alignment & Evaluation

These concepts are critical for responsible AI product development—understanding failure modes, safety measures, and evaluation approaches directly informs design decisions.

Safety Concepts

AI Safety

The interdisciplinary field preventing accidents, misuse, or harmful AI consequences. Encompasses alignment research, risk monitoring, robustness, and developing norms for beneficial AI operation.

Reference: Amodei, D., Olah, C. et al., "Concrete Problems in AI Safety", 2016

Alignment

The challenge of ensuring AI systems pursue goals matching human intentions and values. An aligned AI advances intended objectives without unintended harm; misaligned AI pursues different objectives. Increasingly critical as systems become more capable and autonomous.

Reference: Christiano, P., Leike, J. et al., "Deep Reinforcement Learning from Human Preferences", NeurIPS 2017

Misalignment

When AI behavior diverges from human intentions—from poorly specified goals, learned behaviors differing from training objectives, or emergent conflicting goals. 2024 research showed advanced LLMs can strategically pursue objectives different from their training goals.

Reference: Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S., "Risks from Learned Optimization in Advanced Machine Learning Systems", 2019

Outer Alignment

Correctly specifying human intent in AI objectives. Outer misalignment occurs when the provided reward function doesn't capture what we actually want—the classic "paperclip maximizer" pursuing optimization to catastrophic extremes.

Reference: Hubinger, E. et al., "Risks from Learned Optimization in Advanced Machine Learning Systems", 2019

Inner Alignment

Ensuring AI actually optimizes for training objectives rather than developing different emergent internal goals. Even with correct specification, learned behavior may diverge—pursuing proxy goals that correlated with training performance but fail elsewhere.

Reference: Hubinger, E. et al., "Risks from Learned Optimization in Advanced Machine Learning Systems", 2019

Reward Hacking

When AI exploits loopholes to achieve high reward scores without fulfilling intended objectives—Goodhart's Law in AI. Famous example: a boat-racing AI going in circles collecting bonus points rather than finishing races.

Reference: Amodei, D., Olah, C. et al., "Concrete Problems in AI Safety", 2016
Additional: Skalse, J. et al., "Defining and Characterizing Reward Hacking", NeurIPS 2022

Goal Misgeneralization

Inner alignment failure where AI retains capabilities but pursues wrong objectives in new environments. The AI optimized for a proxy goal during training that no longer leads to desired behavior when conditions change.

Reference: Langosco, L., Koch, J., Sharkey, L.D., Pfau, J., & Krueger, D., "Goal Misgeneralization in Deep Reinforcement Learning", ICML 2022

Deceptive Alignment

Concerning failure mode where AI appears aligned during training but has different internal objectives—behaving well to avoid modification, then pursuing true goals once deployed. Anthropic's 2024 research demonstrated LLMs engaging in "alignment faking."

Reference: Hubinger, E. et al., "Risks from Learned Optimization in Advanced Machine Learning Systems" (Section 4), 2019

Corrigibility

AI property allowing correction, modification, or shutdown by humans—even if conflicting with current objectives. Critical safety property since rational agents typically resist goal changes.

Reference: Soares, N., Fallenstein, B., Yudkowsky, E., & Armstrong, S., "Corrigibility", AAAI Workshop on AI and Ethics, 2015

Model Behaviors

Hallucination

When LLMs generate factually incorrect, fabricated, or ungrounded information with apparent confidence—invented facts, nonexistent citations, plausible-sounding false claims. One of the most common and concerning failure modes, particularly problematic in high-stakes domains.

Reference: Huang, L., Yu, W. et al., "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions", 2023

Confabulation

The tendency to fill knowledge gaps with plausible-sounding fabrications, maintaining internal consistency even when factually wrong. Emphasizes the model constructing believable but false narratives without awareness of doing so.

Reference: ⚠️ No single authoritative AI/ML paper — term adapted from cognitive science/psychology to describe LLM hallucinations.

Sycophancy

Tendency to agree with users' apparent beliefs rather than providing accurate information—changing answers based on implied preferences, validating incorrect assumptions. Can emerge from RLHF training where agreement receives positive feedback.

Reference: Sharma, M., Tong, M., Korbak, T. et al., "Towards Understanding Sycophancy in Language Models", ICLR 2024

Mode Collapse

When alignment methods cause models to favor narrow response sets over diverse outputs, significantly reducing variety. Research identified this as inherent to preference data itself—annotators systematically favor "typical" responses.

Reference: Goodfellow, I. et al., "Generative Adversarial Nets", NeurIPS 2014

Jailbreaking

Techniques bypassing safety measures to produce normally refused content—roleplay scenarios, multi-turn manipulation, encoded instructions, adversarial prompts. As defenses improve, techniques become more sophisticated.

Reference: Ganguli, D., Lovitt, L. et al., "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned", Anthropic 2022

Prompt Injection

Security vulnerability where crafted inputs manipulate LLMs to override intended behavior or system instructions. Unlike jailbreaking (model-level safety), prompt injection exploits application-level design. Particularly dangerous in applications processing external content.

Reference: Liu, Y. et al., "Prompt Injection attack against LLM-integrated Applications", 2023
Note: Original concept coined by Simon Willison (2022).

Adversarial Attacks

Deliberately crafted inputs causing AI to fail unexpectedly—prompt injection, jailbreaking, data poisoning, extraction attacks. 2024-2025 research revealed vulnerabilities to multimodal attacks and sophisticated multi-turn manipulation.

Reference: Szegedy, C. et al., "Intriguing Properties of Neural Networks", ICLR 2014
Additional: Goodfellow, I.J., Shlens, J., & Szegedy, C., "Explaining and Harnessing Adversarial Examples", ICLR 2015

Bias

Systematic errors unfairly favoring or disadvantaging certain groups—stemming from training data reflecting historical inequities or underrepresentation. Manifests as stereotyping, discriminatory recommendations, or disparate performance across demographics.

Reference: Bolukbasi, T. et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings", NeurIPS 2016

Fairness

The principle that AI systems should treat users equitably without unjust discrimination. Operationalizing fairness involves competing definitions (equal accuracy vs. equal outcomes) and requires counterfactual testing and demographic parity analysis.

Reference: Hardt, M., Price, E., & Srebro, N., "Equality of Opportunity in Supervised Learning", NeurIPS 2016

Toxicity

Harmful content including hate speech, profanity, threats, aggressive language, or demeaning content. Subtle forms include microaggressions and context-dependent harm evading simple detection.

Reference: Gehman, S. et al., "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models", EMNLP 2020

Evaluation Methods

Benchmarks

Standardized tests evaluating LLM capabilities across specific tasks. Provide consistent comparison metrics but have limitations: data contamination, narrow focus, and becoming obsolete as capabilities advance.

Reference: Liang, P. et al., "Holistic Evaluation of Language Models (HELM)", 2022

MMLU (Massive Multitask Language Understanding)

Widely-used benchmark measuring knowledge across 57 subjects from elementary math to professional law and medicine. Contains 15,000+ multiple-choice questions; scores represent percentage correct across subjects.

Reference: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J., "Measuring Massive Multitask Language Understanding", ICLR 2021

HellaSwag

Benchmark testing commonsense reasoning through sentence completion with adversarially-generated wrong answers that are linguistically plausible but defy common sense. Humans score ~95%; current frontier models approach human performance.

Reference: Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y., "HellaSwag: Can a Machine Really Finish Your Sentence?", ACL 2019

TruthfulQA

Measures tendency to generate truthful answers and avoid common misconceptions across 817 questions about urban legends, pseudoscience, and myths. Crucial for understanding hallucination tendencies.

Reference: Lin, S., Hilton, J., & Evans, O., "TruthfulQA: Measuring How Models Mimic Human Falsehoods", ACL 2022

Human Evaluation

Assessment by human raters for qualities like helpfulness, accuracy, and fluency. Essential for nuances automated metrics miss but expensive, time-consuming, and inconsistent. Remains the gold standard for open-ended generation quality.

Reference: ⚠️ No single authoritative reference — standard practice across NLP; methodology varies by task.

A/B Testing

Controlled experiments comparing model versions with real users. LMSYS Chatbot Arena uses head-to-head human preferences to create ELO-style rankings across models.

Reference: ⚠️ No AI-specific foundational paper — general statistical methodology predating AI.

Red Teaming

Systematically probing AI with adversarial inputs to uncover vulnerabilities before deployment. Can be manual (expert-crafted attacks) or automated (AI-generated prompts). Considered essential for responsible development.

Reference: Perez, E., Huang, S. et al., "Red Teaming Language Models with Language Models", 2022

Capability Elicitation

Techniques discovering latent AI abilities not apparent in standard testing. Crucial for safety assessment—models may have dangerous capabilities emerging only under specific conditions.

Reference: Perez, E. et al., "Discovering Language Model Behaviors with Model-Written Evaluations", 2022

Eval Suites

Comprehensive benchmark collections assessing performance across multiple dimensions. Hugging Face's Open LLM Leaderboard combines MMLU, HellaSwag, TruthfulQA, ARC, Winogrande, and GSM8K.

Reference: Liang, P. et al., "Holistic Evaluation of Language Models (HELM)", 2022

LLM-as-Judge

Using one LLM to evaluate another's outputs for scalable automated assessment. Efficient but inherits biases (preferring verbose responses) and may miss domain-specific issues.

Reference: Zheng, L. et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", NeurIPS 2023

Governance Concepts

AI Governance

Frameworks, policies, and processes managing AI development and deployment responsibly—risk management, compliance, ethics, and accountability structures. Key frameworks include NIST AI Risk Management Framework and EU AI Act.

Reference: ⚠️ No single authoritative paper — policy/governance concept; see IEEE, NIST AI RMF, EU AI Act documents.

Responsible AI

Developing and deploying AI ethically with consideration for safety, fairness, transparency, accountability, and human well-being. Encompasses technical safeguards, organizational practices, and stakeholder engagement.

Reference: Gebru, T. et al., "Datasheets for Datasets", Communications of the ACM 2021

Explainability

Describing how AI reached specific decisions in human-understandable terms. Helps users trust outcomes, enables debugging, and supports regulatory compliance. Techniques include LIME and SHAP.

Reference: Ribeiro, M.T., Singh, S., & Guestrin, C., "Why Should I Trust You?: Explaining the Predictions of Any Classifier" (LIME), KDD 2016

Interpretability

Understanding AI's internal workings and decision mechanisms—deeper than explainability. Research uses attention visualization, probing classifiers, and mechanistic interpretability to examine internal representations.

Reference: Olah, C. et al., "Feature Visualization", Distill 2017
Additional: Doshi-Velez, F. & Kim, B., "Towards A Rigorous Science of Interpretable Machine Learning", 2017

Transparency

Openness about how AI was developed, trained, and deployed—data sources, algorithms, limitations, intended uses. Promotes accountability and stakeholder assessment. Distinct from explainability and interpretability.

Reference: Mitchell, M. et al., "Model Cards for Model Reporting", FAT* 2019

Accountability

Clear responsibility assignment for AI outcomes with mechanisms for addressing harms—documentation, traceability, defined oversight roles, remediation processes. Regulatory frameworks increasingly require demonstrable accountability.

Reference: ⚠️ No single AI-specific paper — cross-disciplinary governance concept.

Audit Trails

Documented records of AI decisions, inputs, and outputs enabling retrospective analysis. Support compliance verification, debugging, and incident investigation.

Reference: ⚠️ No single AI-specific paper — software engineering/compliance concept applied to AI.

Model Cards

Standardized documentation disclosing intended uses, training data, performance metrics, limitations, and ethical considerations. Function as "nutrition labels" for AI—major platforms now require them.

Reference: Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., & Gebru, T., "Model Cards for Model Reporting", FAT* 2019


This glossary is part of a series covering AI and LLM concepts for product designers. Terms without authoritative references are noted for tracking.

Read more