AI Infrastructure
What Happens When You Deploy an AI Model? The Infrastructure Explained
When you deploy an AI model to production, you are not just uploading a file to a server. You are building and operating a multi-layered infrastructure system that packages the model, serves it through an API, distributes traffic across GPU servers, scales capacity with demand, monitors performance and accuracy, and manages costs that can reach six figures per month.
The model is the product. The infrastructure is everything that makes the product work at scale. And the infrastructure is where most of the engineering effort and most of the hiring actually goes.
This article walks through the complete journey from trained model to production service. Each stage is a real engineering discipline with real career demand. If you are interested in how AI actually works in practice not just in research papers this is how.
Stage 1: Training where the model comes from
Before deployment, the model must be trained. Training is the computationally expensive process of teaching a model to recognise patterns in data.
What training looks like in infrastructure terms
Training a large AI model requires a cluster of GPU servers working together. For a frontier language model, this means:
| Resource | Typical Scale |
|---|---|
| GPUs | Hundreds to thousands of NVIDIA H100s or A100s |
| GPU memory | 80GB per GPU, hundreds of terabytes across the cluster |
| Networking | InfiniBand (400Gbps) for GPU-to-GPU communication |
| Storage | Petabytes of training data on fast NVMe storage |
| Duration | Days to months, depending on model size |
| Cost | $1 million $100 million+ for frontier models |
Training runs are managed as batch jobs. Infrastructure engineers build the systems that:
- Provision GPU clusters on demand using cloud providers or dedicated hardware
- Schedule training jobs across available GPU resources
- Handle failures when a GPU dies mid-training (and they do), the system checkpoints progress and resumes automatically
- Track experiments using tools like MLflow or Weights & Biases so researchers can compare different training approaches
- Manage data pipelines that feed cleaned, preprocessed data to the training cluster
The training cluster is temporary infrastructure it runs for the duration of the training job and then scales down. But building and managing it requires deep cloud and Kubernetes expertise.
Stage 2: Model packaging making it deployable
A trained model is a collection of numerical weights parameters that define the model's behaviour. These weights need to be packaged into a format that a serving system can load and use.
The packaging process
-
Export the model from the training framework (PyTorch, TensorFlow, JAX) into a portable format. Common formats include ONNX (Open Neural Network Exchange), TorchScript, or the framework's native format.
-
Optimise the model for inference. Training and inference have different computational profiles. Optimisation techniques include:
Quantisation reducing the precision of model weights from 32-bit to 16-bit or 8-bit. This halves or quarters the memory needed and speeds up inference, with minimal accuracy loss. Pruning removing unnecessary connections in the neural network to reduce computation. Distillation training a smaller model to mimic the larger one, producing a faster version.
-
Containerise the model using Docker. The model weights, the serving code, and all dependencies are packaged into a Docker container image. This ensures the model runs identically in development, staging, and production.
A typical model container includes: The model weights file (can be gigabytes to hundreds of gigabytes) A model serving framework (vLLM, Triton Inference Server, TorchServe) Python dependencies Configuration for GPU resource allocation Health check endpoints
- Push the container to a container registry (Amazon ECR, Google Artifact Registry, Docker Hub) where it can be pulled by production servers.
Infrastructure skills used: Docker, container registries, model optimisation pipelines, CI/CD for model builds.
Stage 3: API serving layer making the model accessible
A model in a container is not yet useful. It needs an API a way for other applications to send it requests and receive predictions.
Model serving frameworks
The serving layer sits between incoming requests and the model. Popular frameworks include:
| Framework | Strengths | Common Use Cases |
|---|---|---|
| vLLM | Optimised for large language models, high throughput | Chat APIs, text generation |
| Triton Inference Server | Multi-framework support, GPU-optimised batching | Enterprise AI, mixed model types |
| TorchServe | PyTorch-native, easy to set up | PyTorch models, research-to-production |
| TensorFlow Serving | TensorFlow-native, production-hardened | TensorFlow models, Google ecosystem |
| Ray Serve | Flexible, supports complex inference pipelines | Multi-model pipelines, custom logic |
What the serving layer handles
The serving framework does more than just run the model. It manages:
- Request batching grouping multiple incoming requests together so the GPU processes them simultaneously. A GPU processing one request at a time wastes compute. Batching increases throughput dramatically.
- Tokenisation converting text input into the numerical tokens the model understands, and converting output tokens back to text.
- KV caching storing intermediate computation from a conversation so the model does not recompute earlier tokens.
- Memory management loading model weights into GPU memory, managing allocation across multiple concurrent requests.
- Health monitoring reporting whether the model is ready to serve, and how much capacity it has.
The API endpoint
The serving framework exposes an HTTP or gRPC endpoint. Applications send requests to this endpoint and receive predictions:
POST /v1/predict
{
"prompt": "Explain quantum computing",
"max_tokens": 500
}
The API returns the model's response, typically with metadata about inference time, tokens used, and model version. This API is what every application chatbot, search engine, content generator calls to use the AI model.
Infrastructure skills used: API design, container deployment, GPU resource configuration, Kubernetes service configuration.
Stage 4: Load balancing distributing traffic
A single model server cannot handle production traffic. Multiple replicas of the model run across multiple GPU servers, and a load balancer distributes incoming requests across them.
Why AI load balancing is harder than web load balancing
Traditional web load balancing uses simple algorithms like round-robin send each request to the next server in line. AI inference load balancing is more complex because:
- Requests are not equal. A short prompt takes 200 milliseconds. A long conversation with a 4,000-token response takes 30 seconds. Round-robin would overload servers with long requests.
- GPU memory is finite. Each server can only hold one large model (or a few smaller ones) in GPU memory. The load balancer must track which servers have the model loaded.
- Warm-up matters. Loading a large model into GPU memory takes 30-120 seconds. Cold starts are expensive. The load balancer should prefer servers that already have the model loaded.
AI load balancers typically use algorithms based on:
- Queue depth route to the server with the fewest pending requests
- GPU utilisation route to the server with the most available compute
- Request characteristics estimate inference time based on prompt length and route accordingly
Infrastructure skills used: Load balancer configuration (NGINX, HAProxy, cloud-native), health checks, traffic routing policies.
Stage 5: Auto-scaling matching capacity to demand
AI products have variable demand. A chatbot might receive 100 requests per minute at 3 AM and 10,000 requests per minute during business hours. The infrastructure must scale accordingly.
How auto-scaling works for AI
- Monitoring metrics trigger scaling decisions. The auto-scaler watches GPU utilisation, request queue depth, and inference latency.
- Scale-up: When metrics exceed thresholds (for example, average GPU utilisation above 80% for 5 minutes), the auto-scaler provisions new GPU nodes.
- Model loading: New nodes pull the model container image and load model weights into GPU memory. This takes 1-5 minutes depending on model size.
- Traffic routing: The load balancer detects new healthy nodes and begins routing traffic to them.
- Scale-down: When demand drops, the auto-scaler removes nodes to reduce costs. This must be done carefully removing a node that is mid-inference kills active requests.
The cost calculus
Auto-scaling for AI is a direct trade-off between cost and user experience:
| Strategy | Cost | Latency | Risk |
|---|---|---|---|
| Always run maximum capacity | Very high | Very low | No scaling delays |
| Scale reactively on demand | Medium | Medium (scale-up delay) | Latency spikes during scaling |
| Predictive scaling (time-based) | Medium-low | Low | Requires accurate demand forecasting |
| Minimum capacity + aggressive scaling | Low | Variable | Possible degradation during traffic spikes |
Most production systems use a combination: a baseline capacity that handles normal traffic, predictive scaling for known patterns (weekday mornings, product launches), and reactive scaling for unexpected spikes.
Infrastructure skills used: Kubernetes Horizontal Pod Autoscaler, cluster auto-scaling, cloud capacity management, cost optimisation.
Stage 6: Monitoring knowing what is happening
Deploying a model is not the end. In many ways, it is the beginning. Production models require continuous monitoring across multiple dimensions.
Performance monitoring
| Metric | What It Tells You | Alert Threshold (typical) |
|---|---|---|
| Inference latency (p50) | Average user experience | > 2 seconds |
| Inference latency (p99) | Worst-case user experience | > 10 seconds |
| Throughput (requests/second) | System capacity | Approaching max capacity |
| GPU utilisation | Resource efficiency | < 30% (wasting money) or > 90% (at risk) |
| GPU memory usage | Headroom for traffic spikes | > 85% of available |
| Error rate | System reliability | > 0.1% of requests |
| Token generation speed | Model output performance | < 20 tokens/second |
Model quality monitoring
Performance metrics tell you the system is running. Quality metrics tell you the model is working correctly:
- Output quality drift is the model producing lower-quality responses over time? This can happen when the real-world data distribution shifts away from the training data.
- Hallucination rate for language models, how often does the model generate factually incorrect information? This requires automated evaluation pipelines.
- Toxicity and safety is the model producing harmful or inappropriate content? Automated guardrails and human review pipelines monitor this continuously.
- User feedback signals are users rating responses poorly, regenerating responses, or abandoning conversations? These signals feed back into the model improvement cycle.
The monitoring stack
A typical AI monitoring stack includes:
- Prometheus for collecting time-series metrics from model servers
- Grafana for building dashboards and visualisations
- Custom alerting via PagerDuty or Opsgenie for on-call teams
- DCGM Exporter for NVIDIA GPU-specific metrics (temperature, utilisation, memory, power)
- Distributed tracing (Jaeger, Zipkin) for tracking requests across all infrastructure layers
- Log aggregation (ELK stack, Grafana Loki) for debugging issues
When a problem occurs latency spikes, error rates increase, GPU temperature rises the monitoring system detects it automatically and alerts the on-call infrastructure engineer. The response might involve scaling up capacity, rolling back a model version, or cordon off a failing GPU node.
Infrastructure skills used: Prometheus, Grafana, alerting rules, distributed tracing, incident response, on-call practices.
Stage 7: Cost management the business reality
AI infrastructure is expensive. A mid-size AI company can easily spend $100,000-500,000 per month on GPU compute. At this scale, cost management is a critical engineering discipline, not an afterthought.
Where the money goes
| Cost Category | Percentage of Total (typical) | Optimisation Lever |
|---|---|---|
| GPU compute (inference) | 60-75% | Quantisation, batching, right-sizing |
| GPU compute (training) | 10-20% | Spot instances, scheduling, checkpointing |
| Networking | 5-10% | CDN caching, regional deployment |
| Storage | 3-5% | Tiered storage, lifecycle policies |
| Monitoring and tooling | 2-3% | Sampling, retention policies |
Cost optimisation strategies
Infrastructure engineers use several strategies to reduce costs without degrading performance:
- Model quantisation running the model in 8-bit instead of 16-bit reduces GPU memory by 50% and increases throughput, often with less than 1% accuracy loss.
- Request batching processing multiple requests simultaneously increases GPU efficiency from 30-40% to 70-80%.
- Spot/preemptible instances using discounted cloud instances for non-critical workloads (training, batch processing) saves 60-90% versus on-demand pricing.
- Right-sizing GPU instances not every model needs an H100. Smaller models run efficiently on T4s or A10s at a fraction of the cost.
- Caching storing common responses or intermediate computations reduces the number of GPU inference passes needed.
- Scheduling training runs running training jobs during off-peak hours when cloud pricing is lower.
An infrastructure engineer who reduces GPU costs by 20% on a $300,000/month bill saves $720,000 per year far more than their salary. This is why cost-aware infrastructure engineers are so highly valued at AI companies.
The team that makes it all work
None of this infrastructure builds or manages itself. Behind every AI product is an infrastructure team that is typically 3-5 times larger than the research team that built the model.
Roles in AI infrastructure
- DevOps engineers build and maintain CI/CD pipelines for model deployment, manage Kubernetes clusters, and write Terraform for cloud infrastructure
- SREs (Site Reliability Engineers) define reliability targets, build monitoring systems, manage incident response, and ensure the service stays up
- Platform engineers build internal tools and self-service platforms that ML engineers use to deploy and manage models
- MLOps engineers specialise in the ML-specific parts of the pipeline model versioning, experiment tracking, data pipelines, and model monitoring
- Cloud architects design the overall infrastructure architecture across regions, providers, and services
- GPU infrastructure specialists optimise GPU scheduling, manage driver compatibility, and tune GPU-specific performance
Every one of these roles uses cloud and DevOps skills as a foundation: Linux, Docker, Kubernetes, Terraform, CI/CD, monitoring, Python. The AI context adds domain-specific knowledge on top, but the core skill set is the same one that runs all modern infrastructure.
For a detailed look at why these roles dominate AI company hiring, see why every AI company is hiring DevOps engineers. For the complete picture of AI infrastructure, read our guide to AI infrastructure explained.
Why DevOps skills are essential for AI deployment
Every stage of the deployment pipeline from training cluster provisioning to production monitoring is an infrastructure engineering problem. The tools are familiar: Docker, Kubernetes, Terraform, Prometheus, Grafana, GitHub Actions, AWS or Azure or GCP.
The workloads are different from traditional web applications. GPUs instead of CPUs. Model weights instead of application code. Inference latency instead of page load time. But the fundamental skills transfer directly.
If you understand how to deploy, scale, and monitor a containerised web application on Kubernetes, you understand 80% of what is needed to deploy, scale, and monitor a containerised AI model on Kubernetes. The remaining 20% GPU scheduling, model serving frameworks, inference optimisation is learnable on the job or through focused study.
This is the career opportunity. The AI industry is growing faster than the infrastructure talent supply. Companies need engineers who can build and operate this infrastructure, and the foundational skills are the same cloud and DevOps skills that have been in demand for a decade just applied to the most exciting workloads in tech.
Frequently Asked Questions
Ola
Founder, CloudPros
Building the most hands-on DevOps bootcamp for the AI era. 16 weeks of real infrastructure, real projects, real career outcomes.
