AI Glossary: Infrastructure & Deployment
Understanding infrastructure helps designers appreciate performance constraints, cost considerations, and deployment options that affect product decisions.
Hardware & Serving
GPU (Graphics Processing Unit)
Specialized processor for parallel calculations, dominant hardware for AI. GPUs excel at thousands of simultaneous calculations essential for neural networks. NVIDIA's H100 and A100 are industry standards powering most commercial LLMs.
Reference: NVIDIA, "CUDA Toolkit Documentation"
TPU (Tensor Processing Unit)
Google's custom AI accelerator optimized for tensor operations, achieving higher throughput per dollar for certain workloads. Powers Google's services and is available through Google Cloud. More energy-efficient but limited to Google's ecosystem.
Inference Server
Specialized infrastructure for running models in production, optimized for throughput and latency through continuous batching and KV-cache optimization. vLLM, TensorRT-LLM, and similar servers can improve tokens-per-dollar by up to 3x.
Reference: TensorFlow Authors, "TensorFlow Serving"
Model Serving
Making models available for real-time predictions—load balancing, request routing, autoscaling, health monitoring. Efficient serving is crucial since inference costs typically dominate total production costs.
Reference: MLflow Authors, "MLflow Model Serving"
API Endpoint
Programmatic interface allowing applications to send requests and receive responses. Abstracts infrastructure complexity, enabling AI integration via HTTP requests. Handles authentication, rate limiting, and usage tracking.
Reference: Microsoft, "Web API Design Best Practices - Azure Architecture Center"
Rate Limiting
Usage controls preventing abuse and ensuring fair allocation—limits on requests per minute, tokens per minute, or tokens per day. Understanding limits is essential; hitting them causes errors and degrades experience.
Reference: Amazon Web Services, "Throttle requests to your REST APIs for better throughput in API Gateway"
Token Limits / Context Length Limits
Maximum tokens processable per request. Determines how much information a model can "remember"—GPT-4 supports 128K tokens; Gemini 1.5 Pro handles 2 million tokens. Affects application design: longer contexts increase costs and latency.
Reference: Wilsdon, T., "LLM Context Limits Repository", 2024
Edge Deployment
Running models on local devices rather than the cloud—reduced latency, enhanced privacy, offline capability. Challenges include limited memory and compute. Enabling techniques: quantization, pruning, optimized frameworks like llama.cpp.
Reference: Google, "LiteRT (formerly TensorFlow Lite) Overview"
On-Device AI
Processing entirely on user devices like smartphones. Meta's Llama 3.2 (1B/3B) and Google's Gemini Nano are designed for this. Benefits: instant responses, complete privacy, offline functionality. Trade-offs: reduced capability, battery consumption.
Reference: Apple, "Core ML Overview"
Deployment Practices
Model Versioning
Tracking different model iterations throughout their lifecycle—hyperparameters, training data, metrics, artifacts. Enables reproducibility and rollback. Tools: MLflow, Weights & Biases, cloud model registries.
Reference: MLflow Authors, "MLflow Model Registry"
A/B Testing Models
Comparing model variants in production by routing user segments to each version. Validates that new models actually improve business metrics before full deployment.
Canary Deployment
Gradually rolling out new versions to small traffic percentages before full deployment. If problems emerge, rollback is quick with minimal impact.
Reference: Microsoft, "Use a canary deployment strategy for Kubernetes - Azure Pipelines"
Model Monitoring
Continuous production observation—operational metrics (latency, throughput, errors), quality metrics (accuracy), data characteristics. Closes the feedback loop between deployment and retraining.
Reference: Google Cloud, "MLOps: Continuous delivery and automation pipelines in machine learning"
Drift Detection
Identifying when input data statistics or input-output relationships change post-deployment, causing performance degradation. When detected, models may need retraining.
Reference: Lu, J. et al., "Learning under Concept Drift: A Review", IEEE TKDE, 2018
Practical: Evidently AI, "What is concept drift in ML"
Prompt Management / Versioning
Systematically managing and optimizing production prompts. Small prompt changes can cause unexpected behavior changes—requires version control, A/B testing, and tracking which prompts work best.
Reference: LangChain/LangSmith, "Prompt engineering concepts"
This glossary is part of a series covering AI and LLM concepts for product designers.