AI Glossary: Model Architecture & Fundamentals

AI Glossary: Model Architecture & Fundamentals

Understanding how LLMs are built helps designers grasp capability limits, performance tradeoffs, and why certain behaviors emerge. This section covers the core concepts that explain how modern AI models work.

Core Architecture

Transformer Architecture

The neural network design that revolutionized AI, introduced in 2017's "Attention Is All You Need" paper. Unlike earlier models that processed text sequentially, transformers use self-attention to capture relationships between all elements simultaneously. This architecture underlies GPT, Claude, LLaMA, and virtually every modern LLM—making it foundational knowledge for anyone in the field.

Reference: Vaswani, A. et al., "Attention Is All You Need", NeurIPS 2017

Attention Mechanism

A component that allows models to dynamically focus on different parts of an input when producing output, learning which elements are most relevant to the task. Attention computes relevance scores using query, key, and value vectors, enabling models to connect information regardless of position. This solved the critical limitation of earlier models struggling with long-range text dependencies.

Reference: Bahdanau, D., Cho, K., & Bengio, Y., "Neural Machine Translation by Jointly Learning to Align and Translate", ICLR 2015

Self-Attention

The mechanism where each position in a sequence computes attention scores with every other position, allowing tokens to gather contextual information from the entire input. Unlike traditional attention that relates two different sequences, self-attention enables a word to understand its relationship with all other words in the same sentence—the core innovation giving transformers their power.

Reference: Vaswani, A. et al., "Attention Is All You Need", NeurIPS 2017 (Section 3.2)

Multi-Head Attention

An extension running multiple attention operations in parallel, each with different learned projections, then concatenating and transforming their outputs. This allows the model to simultaneously attend to different types of information—one head might capture syntax while another captures meaning. A 7B parameter model typically uses 32 attention heads.

Reference: Vaswani, A. et al., "Attention Is All You Need", NeurIPS 2017 (Section 3.2.2)

Encoder-Decoder Architecture

A design where the encoder processes input into contextual representations, and the decoder generates output from those representations. The encoder builds bidirectional understanding; the decoder generates autoregressively with cross-attention to encoder output. This powers translation and summarization models like T5 and BART.

Reference: Sutskever, I., Vinyals, O., & Le, Q.V., "Sequence to Sequence Learning with Neural Networks", NeurIPS 2014

Decoder-Only Architecture

The design using only masked (causal) self-attention, where each token can only attend to previous tokens. This powers most generative LLMs including GPT-4, Claude, and LLaMA because it naturally suits autoregressive generation—predicting one token at a time without "cheating" by seeing future tokens.

Reference: Radford, A. et al., "Improving Language Understanding by Generative Pre-Training" (GPT-1), OpenAI 2018

Embeddings

Dense vector representations converting discrete tokens into continuous numerical formats neural networks can process, capturing semantic meaning in high-dimensional space. Unlike one-hot encoding, embeddings place similar words closer together ("king" near "queen"). Embedding dimensions range from 768 in BERT to 4096+ in larger models.

Reference: Mikolov, T. et al., "Efficient Estimation of Word Representations in Vector Space" (Word2Vec), 2013

Tokenization

Breaking raw text into discrete units the model can process—whole words, subwords, or characters depending on the algorithm. Modern LLMs use subword methods like Byte Pair Encoding (BPE) to balance vocabulary size with handling rare words. One English word typically equals 1-2 tokens, and tokenization significantly impacts multilingual performance and context utilization.

Reference: Sennrich, R., Haddow, B., & Birch, A., "Neural Machine Translation of Rare Words with Subword Units" (BPE), ACL 2016

Context Window

The maximum tokens an LLM can process in a single input, determining how much information it can "see" simultaneously. Early GPT-2 had 1,024 tokens; modern models support 100K+ tokens (roughly a novel). Larger windows enable better long-range reasoning but increase computational costs quadratically due to attention complexity.

Reference: Vaswani, A. et al., "Attention Is All You Need", NeurIPS 2017
Note: Operational concept that evolved with transformer models; original paper discusses sequence length limitations.

Parameters

The learnable numerical values (weights and biases) adjusted during training to minimize prediction error. Parameter count serves as a rough capability proxy—GPT-3 has 175 billion parameters, while Llama 2 ranges from 7B to 70B. However, architecture efficiency, training data quality, and compute all significantly impact actual performance.

Reference: ⚠️ No single authoritative reference — fundamental machine learning concept defined in standard ML textbooks (weights and biases in neural networks).

Layers (Depth)

Stacked processing stages where each transformer layer typically contains multi-head attention followed by a feed-forward network. GPT-3 has 96 layers; smaller models have 12-24. Each layer progressively refines representations, with earlier layers learning lower-level patterns and deeper layers capturing abstract concepts.

Reference: LeCun, Y., Bengio, Y., & Hinton, G., "Deep Learning", Nature 521, 2015

Model Types & Variants

GPT (Generative Pre-trained Transformer)

OpenAI's family of decoder-only models pre-trained on massive text corpora using next-token prediction. "Generative" indicates text generation capability; "pre-trained" indicates initial training on general data. GPT models established scaling laws and demonstrated emergent capabilities at large scale.

Reference: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I., "Improving Language Understanding by Generative Pre-Training", OpenAI 2018

BERT (Bidirectional Encoder Representations from Transformers)

Google's encoder-only model using bidirectional attention to understand context from both directions simultaneously. Trained with masked language modeling (predicting hidden words from surrounding context), BERT excels at understanding tasks—classification, question answering, entity recognition—rather than generation.

Reference: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL 2019

LLaMA (Large Language Model Meta AI)

Meta's open-weight foundation models demonstrating how smaller, efficiently trained models can match larger ones. LLaMA uses decoder-only architecture with optimizations like rotary positional embeddings (RoPE). The open release sparked community research and democratized LLM development.

Reference: Touvron, H. et al., "LLaMA: Open and Efficient Foundation Language Models", Meta AI 2023

Claude

Anthropic's AI assistants built using Constitutional AI to be helpful, harmless, and honest. Claude models support large context windows (up to 200K tokens) with emphasis on reducing harmful outputs and being transparent about limitations.

Reference: Anthropic, "The Claude 3 Model Family: Opus, Sonnet, Haiku" (Model Card), 2024

Diffusion Models

Generative systems that learn to denoise random noise into coherent outputs (images, audio, video) by reversing a forward noising process. Unlike autoregressive generation, diffusion works iteratively across entire outputs simultaneously. Stable Diffusion and DALL-E 3 are prominent examples.

Reference: Ho, J., Jain, A., & Abbeel, P., "Denoising Diffusion Probabilistic Models", NeurIPS 2020

Multimodal Models

Models processing and generating content across multiple types—text, images, audio, or video. These learn shared representations across modalities, enabling image captioning, visual question answering, and text-to-image generation. GPT-4V, Gemini, and Claude 3 extend transformer architectures to handle diverse inputs.

Reference: Radford, A. et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP), ICML 2021

Vision-Language Models (VLMs)

Systems jointly processing visual and textual information through aligned representations. They typically combine a vision encoder with a language model via projection layers. CLIP pioneered this with image-text training, enabling zero-shot image classification.

Reference: Radford, A. et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP), ICML 2021

Mixture of Experts (MoE)

Architecture replacing dense feed-forward layers with multiple specialized "expert" subnetworks, with a router activating only relevant experts per token. This scales capacity while keeping per-token cost constant—Mixtral 8x7B has 47B total parameters but only uses ~13B per forward pass.

Reference: Shazeer, N. et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", ICLR 2017

Small Language Models (SLMs)

Efficient models typically under 7B parameters optimized for resource-constrained deployment. Models like Phi-3, Gemma, and SmolLM achieve competitive performance through quality data curation and knowledge distillation. SLMs enable on-device AI, reduced latency, and lower costs.

Reference: ⚠️ No single authoritative reference — recent category term describing efficient smaller LLMs. Notable examples include Microsoft's Phi series papers.

Technical Concepts

Inference

Using a trained model to generate predictions from new inputs, as opposed to training. For LLMs, inference involves processing prompts and generating tokens autoregressively. Optimization is critical for production—techniques like KV-caching, quantization, and batching significantly improve throughput.

Reference: ⚠️ No single authoritative reference — fundamental ML/statistics concept referring to using trained models to make predictions.

Latency

Time from prompt submission to response receipt, critical for user experience. Key metrics include Time to First Token (TTFT) and Inter-Token Latency (ITL). Factors include model size, hardware, prompt length (affecting TTFT linearly), and output length.

Reference: ⚠️ No single authoritative reference — standard computing/systems concept commonly discussed in LLM serving papers.

Throughput

Work completed over time, typically expressed as tokens per second (TPS) or requests per second (RPS). There's a fundamental tradeoff: batching increases throughput but can increase individual request latency.

Reference: ⚠️ No single authoritative reference — standard computing/systems concept commonly discussed in LLM serving papers.

Tokens (Units)

Fundamental text units LLMs process—not necessarily whole words but often subwords determined by the tokenizer. API pricing is based on token counts, context limits are in tokens, and tokenization quirks affect model behavior.

Reference: Sennrich, R., Haddow, B., & Birch, A., "Neural Machine Translation of Rare Words with Subword Units", ACL 2016

Perplexity

Evaluation metric measuring how "surprised" a model is when predicting test sequences—lower perplexity indicates better prediction. Useful for comparing models on same datasets but doesn't measure factual accuracy, reasoning, or alignment.

Reference: ⚠️ No single authoritative reference — classic information theory metric from Shannon's work. Standard definition in Bishop, C., "Pattern Recognition and Machine Learning", 2006.

Temperature

Inference-time parameter controlling output randomness by scaling logits before softmax. Temperature=0 approaches deterministic selection; higher values (0.7-1.2) increase creativity. Lower temperatures suit factual tasks; higher values suit creative writing.

Reference: Hinton, G., Vinyals, O., & Dean, J., "Distilling the Knowledge in a Neural Network", NeurIPS Deep Learning Workshop 2015
Note: Temperature scaling borrowed from statistical mechanics (Boltzmann distribution).

Top-k Sampling

Generation strategy limiting selection to the k most probable tokens, redistributing probability among only those candidates. Setting k=50 eliminates low-probability outliers that could produce nonsensical output.

Reference: Fan, A., Lewis, M., & Dauphin, Y., "Hierarchical Neural Story Generation", ACL 2018

Top-p Sampling (Nucleus Sampling)

Dynamically selects the smallest token set whose cumulative probability exceeds threshold p. Unlike fixed top-k, top-p adapts to probability distributions—more consistent quality across contexts.

Reference: Holtzman, A. et al., "The Curious Case of Neural Text Degeneration", ICLR 2020

Logits

Raw, unnormalized outputs from the final layer before softmax, representing confidence scores for each possible token. Values range from negative to positive infinity and must be converted to probabilities.

Reference: ⚠️ No single authoritative reference — standard term for raw (unnormalized) model outputs before softmax; derived from "log-odds" in statistics/logistic regression.

Softmax

Function converting logits into a probability distribution (all values 0-1, summing to 1). Temperature scaling is applied before softmax to control randomness.

Reference: Bridle, J.S., "Probabilistic Interpretation of Feedforward Classification Network Outputs", Neurocomputing (NATO ASI Series), 1990

Decoding strategy maintaining multiple candidate sequences in parallel, expanding and pruning to keep most promising paths. Improves quality for tasks with "correct answers" but increases memory cost proportionally to beam width.

Reference: Meister, C., Cotterell, R., & Vieira, T., "If Beam Search is the Answer, What was the Question?", EMNLP 2020

Greedy Decoding

Simplest strategy always selecting the highest-probability token. Fast and deterministic but often produces repetitive text. Suitable for structured extraction where format constraints limit outputs.

Reference: ⚠️ No single authoritative reference — fundamental algorithm concept (selecting highest probability token at each step). Discussed in Sutskever et al. (2014) and subsequent seq2seq papers.


This glossary is part of a series covering AI and LLM concepts for product designers. Terms without authoritative references are noted for tracking.

Read more