AI Glossary: Safety, Alignment & Evaluation
These concepts are critical for responsible AI product development—understanding failure modes, safety measures, and evaluation approaches directly informs design decisions.
Safety Concepts
AI Safety
The interdisciplinary field preventing accidents, misuse, or harmful AI consequences. Encompasses alignment research, risk monitoring, robustness, and developing norms for beneficial AI operation.
Reference: Amodei, D., Olah, C. et al., "Concrete Problems in AI Safety", 2016
Alignment
The challenge of ensuring AI systems pursue goals matching human intentions and values. An aligned AI advances intended objectives without unintended harm; misaligned AI pursues different objectives. Increasingly critical as systems become more capable and autonomous.
Reference: Christiano, P., Leike, J. et al., "Deep Reinforcement Learning from Human Preferences", NeurIPS 2017
Misalignment
When AI behavior diverges from human intentions—from poorly specified goals, learned behaviors differing from training objectives, or emergent conflicting goals. 2024 research showed advanced LLMs can strategically pursue objectives different from their training goals.
Outer Alignment
Correctly specifying human intent in AI objectives. Outer misalignment occurs when the provided reward function doesn't capture what we actually want—the classic "paperclip maximizer" pursuing optimization to catastrophic extremes.
Reference: Hubinger, E. et al., "Risks from Learned Optimization in Advanced Machine Learning Systems", 2019
Inner Alignment
Ensuring AI actually optimizes for training objectives rather than developing different emergent internal goals. Even with correct specification, learned behavior may diverge—pursuing proxy goals that correlated with training performance but fail elsewhere.
Reference: Hubinger, E. et al., "Risks from Learned Optimization in Advanced Machine Learning Systems", 2019
Reward Hacking
When AI exploits loopholes to achieve high reward scores without fulfilling intended objectives—Goodhart's Law in AI. Famous example: a boat-racing AI going in circles collecting bonus points rather than finishing races.
Reference: Amodei, D., Olah, C. et al., "Concrete Problems in AI Safety", 2016
Additional: Skalse, J. et al., "Defining and Characterizing Reward Hacking", NeurIPS 2022
Goal Misgeneralization
Inner alignment failure where AI retains capabilities but pursues wrong objectives in new environments. The AI optimized for a proxy goal during training that no longer leads to desired behavior when conditions change.
Deceptive Alignment
Concerning failure mode where AI appears aligned during training but has different internal objectives—behaving well to avoid modification, then pursuing true goals once deployed. Anthropic's 2024 research demonstrated LLMs engaging in "alignment faking."
Corrigibility
AI property allowing correction, modification, or shutdown by humans—even if conflicting with current objectives. Critical safety property since rational agents typically resist goal changes.
Model Behaviors
Hallucination
When LLMs generate factually incorrect, fabricated, or ungrounded information with apparent confidence—invented facts, nonexistent citations, plausible-sounding false claims. One of the most common and concerning failure modes, particularly problematic in high-stakes domains.
Confabulation
The tendency to fill knowledge gaps with plausible-sounding fabrications, maintaining internal consistency even when factually wrong. Emphasizes the model constructing believable but false narratives without awareness of doing so.
Reference: ⚠️ No single authoritative AI/ML paper — term adapted from cognitive science/psychology to describe LLM hallucinations.
Sycophancy
Tendency to agree with users' apparent beliefs rather than providing accurate information—changing answers based on implied preferences, validating incorrect assumptions. Can emerge from RLHF training where agreement receives positive feedback.
Mode Collapse
When alignment methods cause models to favor narrow response sets over diverse outputs, significantly reducing variety. Research identified this as inherent to preference data itself—annotators systematically favor "typical" responses.
Reference: Goodfellow, I. et al., "Generative Adversarial Nets", NeurIPS 2014
Jailbreaking
Techniques bypassing safety measures to produce normally refused content—roleplay scenarios, multi-turn manipulation, encoded instructions, adversarial prompts. As defenses improve, techniques become more sophisticated.
Prompt Injection
Security vulnerability where crafted inputs manipulate LLMs to override intended behavior or system instructions. Unlike jailbreaking (model-level safety), prompt injection exploits application-level design. Particularly dangerous in applications processing external content.
Reference: Liu, Y. et al., "Prompt Injection attack against LLM-integrated Applications", 2023
Note: Original concept coined by Simon Willison (2022).
Adversarial Attacks
Deliberately crafted inputs causing AI to fail unexpectedly—prompt injection, jailbreaking, data poisoning, extraction attacks. 2024-2025 research revealed vulnerabilities to multimodal attacks and sophisticated multi-turn manipulation.
Reference: Szegedy, C. et al., "Intriguing Properties of Neural Networks", ICLR 2014
Additional: Goodfellow, I.J., Shlens, J., & Szegedy, C., "Explaining and Harnessing Adversarial Examples", ICLR 2015
Bias
Systematic errors unfairly favoring or disadvantaging certain groups—stemming from training data reflecting historical inequities or underrepresentation. Manifests as stereotyping, discriminatory recommendations, or disparate performance across demographics.
Fairness
The principle that AI systems should treat users equitably without unjust discrimination. Operationalizing fairness involves competing definitions (equal accuracy vs. equal outcomes) and requires counterfactual testing and demographic parity analysis.
Reference: Hardt, M., Price, E., & Srebro, N., "Equality of Opportunity in Supervised Learning", NeurIPS 2016
Toxicity
Harmful content including hate speech, profanity, threats, aggressive language, or demeaning content. Subtle forms include microaggressions and context-dependent harm evading simple detection.
Evaluation Methods
Benchmarks
Standardized tests evaluating LLM capabilities across specific tasks. Provide consistent comparison metrics but have limitations: data contamination, narrow focus, and becoming obsolete as capabilities advance.
Reference: Liang, P. et al., "Holistic Evaluation of Language Models (HELM)", 2022
MMLU (Massive Multitask Language Understanding)
Widely-used benchmark measuring knowledge across 57 subjects from elementary math to professional law and medicine. Contains 15,000+ multiple-choice questions; scores represent percentage correct across subjects.
HellaSwag
Benchmark testing commonsense reasoning through sentence completion with adversarially-generated wrong answers that are linguistically plausible but defy common sense. Humans score ~95%; current frontier models approach human performance.
TruthfulQA
Measures tendency to generate truthful answers and avoid common misconceptions across 817 questions about urban legends, pseudoscience, and myths. Crucial for understanding hallucination tendencies.
Human Evaluation
Assessment by human raters for qualities like helpfulness, accuracy, and fluency. Essential for nuances automated metrics miss but expensive, time-consuming, and inconsistent. Remains the gold standard for open-ended generation quality.
Reference: ⚠️ No single authoritative reference — standard practice across NLP; methodology varies by task.
A/B Testing
Controlled experiments comparing model versions with real users. LMSYS Chatbot Arena uses head-to-head human preferences to create ELO-style rankings across models.
Reference: ⚠️ No AI-specific foundational paper — general statistical methodology predating AI.
Red Teaming
Systematically probing AI with adversarial inputs to uncover vulnerabilities before deployment. Can be manual (expert-crafted attacks) or automated (AI-generated prompts). Considered essential for responsible development.
Reference: Perez, E., Huang, S. et al., "Red Teaming Language Models with Language Models", 2022
Capability Elicitation
Techniques discovering latent AI abilities not apparent in standard testing. Crucial for safety assessment—models may have dangerous capabilities emerging only under specific conditions.
Reference: Perez, E. et al., "Discovering Language Model Behaviors with Model-Written Evaluations", 2022
Eval Suites
Comprehensive benchmark collections assessing performance across multiple dimensions. Hugging Face's Open LLM Leaderboard combines MMLU, HellaSwag, TruthfulQA, ARC, Winogrande, and GSM8K.
Reference: Liang, P. et al., "Holistic Evaluation of Language Models (HELM)", 2022
LLM-as-Judge
Using one LLM to evaluate another's outputs for scalable automated assessment. Efficient but inherits biases (preferring verbose responses) and may miss domain-specific issues.
Reference: Zheng, L. et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", NeurIPS 2023
Governance Concepts
AI Governance
Frameworks, policies, and processes managing AI development and deployment responsibly—risk management, compliance, ethics, and accountability structures. Key frameworks include NIST AI Risk Management Framework and EU AI Act.
Reference: ⚠️ No single authoritative paper — policy/governance concept; see IEEE, NIST AI RMF, EU AI Act documents.
Responsible AI
Developing and deploying AI ethically with consideration for safety, fairness, transparency, accountability, and human well-being. Encompasses technical safeguards, organizational practices, and stakeholder engagement.
Reference: Gebru, T. et al., "Datasheets for Datasets", Communications of the ACM 2021
Explainability
Describing how AI reached specific decisions in human-understandable terms. Helps users trust outcomes, enables debugging, and supports regulatory compliance. Techniques include LIME and SHAP.
Interpretability
Understanding AI's internal workings and decision mechanisms—deeper than explainability. Research uses attention visualization, probing classifiers, and mechanistic interpretability to examine internal representations.
Reference: Olah, C. et al., "Feature Visualization", Distill 2017
Additional: Doshi-Velez, F. & Kim, B., "Towards A Rigorous Science of Interpretable Machine Learning", 2017
Transparency
Openness about how AI was developed, trained, and deployed—data sources, algorithms, limitations, intended uses. Promotes accountability and stakeholder assessment. Distinct from explainability and interpretability.
Reference: Mitchell, M. et al., "Model Cards for Model Reporting", FAT* 2019
Accountability
Clear responsibility assignment for AI outcomes with mechanisms for addressing harms—documentation, traceability, defined oversight roles, remediation processes. Regulatory frameworks increasingly require demonstrable accountability.
Reference: ⚠️ No single AI-specific paper — cross-disciplinary governance concept.
Audit Trails
Documented records of AI decisions, inputs, and outputs enabling retrospective analysis. Support compliance verification, debugging, and incident investigation.
Reference: ⚠️ No single AI-specific paper — software engineering/compliance concept applied to AI.
Model Cards
Standardized documentation disclosing intended uses, training data, performance metrics, limitations, and ethical considerations. Function as "nutrition labels" for AI—major platforms now require them.
This glossary is part of a series covering AI and LLM concepts for product designers. Terms without authoritative references are noted for tracking.