AI Glossary: Infrastructure & Deployment

AI Glossary: Infrastructure & Deployment

Understanding infrastructure helps designers appreciate performance constraints, cost considerations, and deployment options that affect product decisions.

Hardware & Serving

GPU (Graphics Processing Unit)

Specialized processor for parallel calculations, dominant hardware for AI. GPUs excel at thousands of simultaneous calculations essential for neural networks. NVIDIA's H100 and A100 are industry standards powering most commercial LLMs.

Reference: NVIDIA, "CUDA Toolkit Documentation"

TPU (Tensor Processing Unit)

Google's custom AI accelerator optimized for tensor operations, achieving higher throughput per dollar for certain workloads. Powers Google's services and is available through Google Cloud. More energy-efficient but limited to Google's ecosystem.

Reference: Jouppi, N. et al. (Google), "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017

Inference Server

Specialized infrastructure for running models in production, optimized for throughput and latency through continuous batching and KV-cache optimization. vLLM, TensorRT-LLM, and similar servers can improve tokens-per-dollar by up to 3x.

Reference: TensorFlow Authors, "TensorFlow Serving"

Model Serving

Making models available for real-time predictions—load balancing, request routing, autoscaling, health monitoring. Efficient serving is crucial since inference costs typically dominate total production costs.

Reference: MLflow Authors, "MLflow Model Serving"

API Endpoint

Programmatic interface allowing applications to send requests and receive responses. Abstracts infrastructure complexity, enabling AI integration via HTTP requests. Handles authentication, rate limiting, and usage tracking.

Reference: Microsoft, "Web API Design Best Practices - Azure Architecture Center"

Rate Limiting

Usage controls preventing abuse and ensuring fair allocation—limits on requests per minute, tokens per minute, or tokens per day. Understanding limits is essential; hitting them causes errors and degrades experience.

Reference: Amazon Web Services, "Throttle requests to your REST APIs for better throughput in API Gateway"

Token Limits / Context Length Limits

Maximum tokens processable per request. Determines how much information a model can "remember"—GPT-4 supports 128K tokens; Gemini 1.5 Pro handles 2 million tokens. Affects application design: longer contexts increase costs and latency.

Reference: Wilsdon, T., "LLM Context Limits Repository", 2024

Edge Deployment

Running models on local devices rather than the cloud—reduced latency, enhanced privacy, offline capability. Challenges include limited memory and compute. Enabling techniques: quantization, pruning, optimized frameworks like llama.cpp.

Reference: Google, "LiteRT (formerly TensorFlow Lite) Overview"

On-Device AI

Processing entirely on user devices like smartphones. Meta's Llama 3.2 (1B/3B) and Google's Gemini Nano are designed for this. Benefits: instant responses, complete privacy, offline functionality. Trade-offs: reduced capability, battery consumption.

Reference: Apple, "Core ML Overview"

Deployment Practices

Model Versioning

Tracking different model iterations throughout their lifecycle—hyperparameters, training data, metrics, artifacts. Enables reproducibility and rollback. Tools: MLflow, Weights & Biases, cloud model registries.

Reference: MLflow Authors, "MLflow Model Registry"

A/B Testing Models

Comparing model variants in production by routing user segments to each version. Validates that new models actually improve business metrics before full deployment.

Reference: Amazon Web Services, "Dynamic A/B testing for machine learning models with Amazon SageMaker MLOps projects", 2022

Canary Deployment

Gradually rolling out new versions to small traffic percentages before full deployment. If problems emerge, rollback is quick with minimal impact.

Reference: Microsoft, "Use a canary deployment strategy for Kubernetes - Azure Pipelines"

Model Monitoring

Continuous production observation—operational metrics (latency, throughput, errors), quality metrics (accuracy), data characteristics. Closes the feedback loop between deployment and retraining.

Reference: Google Cloud, "MLOps: Continuous delivery and automation pipelines in machine learning"

Drift Detection

Identifying when input data statistics or input-output relationships change post-deployment, causing performance degradation. When detected, models may need retraining.

Reference: Lu, J. et al., "Learning under Concept Drift: A Review", IEEE TKDE, 2018
Practical: Evidently AI, "What is concept drift in ML"

Prompt Management / Versioning

Systematically managing and optimizing production prompts. Small prompt changes can cause unexpected behavior changes—requires version control, A/B testing, and tracking which prompts work best.

Reference: LangChain/LangSmith, "Prompt engineering concepts"


This glossary is part of a series covering AI and LLM concepts for product designers.

Read more