AI Infrastructure
The Infrastructure Behind ChatGPT: What It Actually Takes
ChatGPT serves over 800 million users per week. Every time you type a prompt and receive a response, an extraordinary amount of cloud infrastructure is working behind the scenes. The models get the headlines. The infrastructure makes them work.
This article breaks down exactly what that infrastructure looks like from the GPU clusters that run inference to the monitoring systems that keep everything reliable. It's a technical but accessible guide to the most expensive, most complex, and most career-relevant infrastructure in tech right now.
The request journey: what happens when you send a prompt
When you type a prompt into ChatGPT, your request travels through at least six infrastructure layers before you see a response. Each layer is a cloud engineering problem.
Layer 1: API Gateway
Your HTTPS request first hits an API gateway. This handles:
- Authentication verifying your API key or session token
- Rate limiting preventing abuse and managing capacity
- Request routing directing traffic to the correct model and version
- TLS termination decrypting the secure connection
API gateways at this scale handle billions of requests per day. They need to be highly available (99.99%+ uptime), globally distributed, and fast enough to add negligible latency.
Infrastructure skill: Load balancer configuration, TLS management, API gateway design (Kong, AWS API Gateway, custom solutions).
Layer 2: Load Balancer
After the gateway, a load balancer distributes requests across inference servers. This isn't round-robin balancing it's intelligent routing based on:
- GPU memory availability is the model loaded on this server?
- Current queue depth how many requests is each server processing?
- Geographic proximity route to the nearest data centre
- Model version different users might be served different model versions (A/B testing)
Load balancing for AI inference is harder than traditional web traffic because each request consumes significant GPU compute and takes longer to process (seconds, not milliseconds).
Infrastructure skill: Load balancer configuration, health checks, traffic routing, capacity planning.
Layer 3: Inference Cluster (the GPU layer)
This is where the actual computation happens. ChatGPT's inference cluster consists of thousands of NVIDIA GPUs primarily A100s and H100s distributed across multiple data centres.
What each inference server looks like:
| Component | Specification |
|---|---|
| GPUs | 4-8 NVIDIA H100s per node |
| GPU memory | 80GB per GPU (320-640GB per node) |
| RAM | 512GB 2TB system memory |
| Networking | InfiniBand for GPU-to-GPU, 100Gbps Ethernet |
| Storage | NVMe SSDs for model weight loading |
How inference works:
- Model weights (hundreds of gigabytes) are pre-loaded into GPU memory
- Your prompt is tokenised and sent to the inference server
- The model generates tokens one at a time, each requiring a forward pass through the neural network
- Tokens stream back to you in real time (that's why you see the response appear word by word)
A single inference request for a long conversation can consume a full GPU for several seconds. Multiply that by hundreds of millions of weekly users, and you understand why the GPU cluster is so massive.
Infrastructure skill: Kubernetes GPU scheduling, NVIDIA device plugin, node affinity, resource limits, cluster autoscaling.
Layer 4: Caching and Optimisation
Not every request needs a full GPU inference pass. Caching layers significantly reduce cost and latency:
- KV cache stores intermediate computation from earlier tokens in a conversation, so the model doesn't recompute them
- Prompt caching common system prompts can be pre-computed
- Semantic caching similar queries can sometimes reuse previous responses
- CDN caching static assets (the UI, images) served from edge locations
Caching at this scale saves millions of dollars per month. A 10% improvement in cache hit rate can reduce GPU costs by tens of thousands per day.
Infrastructure skill: Redis/Memcached deployment, CDN configuration, cache invalidation strategies.
Layer 5: Monitoring and Observability
Running thousands of GPUs serving hundreds of millions of users requires comprehensive monitoring:
What gets monitored:
| Metric | Why It Matters |
|---|---|
| Inference latency (p50, p95, p99) | User experience slow responses lose users |
| GPU utilisation | Cost efficiency idle GPUs waste money |
| GPU temperature | Hardware health overheating causes throttling or failure |
| GPU memory usage | Capacity out-of-memory crashes kill requests |
| Token throughput | System capacity tokens generated per second |
| Error rate | Reliability failed requests need immediate attention |
| Queue depth | Scaling signal growing queues mean more capacity needed |
| Cost per request | Business metric the bottom line |
The monitoring stack:
- DCGM Exporter sends GPU metrics to Prometheus
- Prometheus stores time-series metrics
- Grafana visualises dashboards
- Custom alerting triggers PagerDuty for on-call engineers
- Distributed tracing tracks requests across all layers
When something goes wrong a GPU fails, latency spikes, a data centre has a network issue the monitoring system detects it, alerts the on-call team, and often triggers automated remediation before users notice.
Infrastructure skill: Prometheus, Grafana, alerting rules, distributed tracing, incident response.
Layer 6: Networking and Global Distribution
ChatGPT serves users worldwide. The networking layer handles:
- Multi-region deployment inference clusters in multiple geographic regions
- DNS routing directing users to the nearest healthy region
- Inter-region replication keeping model weights synchronised across data centres
- Network security DDoS protection, firewall rules, encryption in transit
- Bandwidth management streaming responses to millions of concurrent users
The networking alone is a full-time job for multiple teams. Global low-latency delivery of GPU-computed responses is among the hardest networking problems in tech.
Infrastructure skill: VPC design, DNS management, CDN configuration, network security, multi-region architecture.
The cost: why AI infrastructure is expensive
Running ChatGPT-scale infrastructure is extraordinarily expensive:
| Cost Category | Estimated Daily Cost |
|---|---|
| GPU compute (inference) | $500,000 $700,000+ |
| Cloud networking and bandwidth | $50,000 $100,000 |
| Storage (model weights, logs, caches) | $20,000 $50,000 |
| Monitoring and observability tooling | $10,000 $20,000 |
| Engineering team salaries | Hundreds of engineers |
A single NVIDIA H100 costs approximately $30,000 to purchase or $3-4 per hour to rent from cloud providers. A training run for a frontier model uses 10,000+ GPUs for weeks. The numbers are staggering.
This is exactly why companies pay $150,000-$250,000+ for engineers who can optimise these costs. A 5% improvement in GPU utilisation across a 10,000-GPU cluster saves millions per year.
The career opportunity nobody talks about
The media focuses on AI models who trained the biggest model, which chatbot is best, what AI can generate. But behind every model is an infrastructure team that's typically 3-5x larger than the research team.
Roles this infrastructure creates:
- DevOps Engineers build and maintain the CI/CD pipelines that deploy model updates
- Cloud Architects design the multi-region GPU infrastructure
- SREs keep the system reliable and respond to incidents
- Platform Engineers build internal tools for ML engineers
- MLOps Engineers manage the model lifecycle from training to production
- GPU Infrastructure Specialists optimise GPU scheduling, costs, and performance
None of these roles require a machine learning PhD. They require cloud and DevOps skills Kubernetes, Docker, Terraform, monitoring, CI/CD, and cloud platform expertise. The same skills that run traditional web infrastructure, applied to the most demanding workloads in tech.
The demand for these roles is growing faster than any other category in tech, with 28-41% year-over-year increases in job postings. The supply of qualified engineers is not keeping up.
What you'd need to learn
To work on infrastructure like this, here's the skill stack:
- Linux every server in this architecture runs Linux
- Networking understanding TCP/IP, DNS, load balancing, VPCs
- Docker containerising applications and model serving code
- Kubernetes orchestrating containers, GPU scheduling, auto-scaling
- Cloud platform (AWS/Azure/GCP) compute, storage, networking services
- Terraform defining infrastructure as code
- CI/CD automating deployment pipelines
- Monitoring Prometheus, Grafana, alerting, observability
- Python automation, scripting, cloud API integrations
- Security network security, IAM, encryption, compliance
This is the full cloud and DevOps learning path. It takes 4-6 months of focused learning and maps directly to the roles that AI companies are hiring for.
The complete guide to AI infrastructure covers how each of these skills applies specifically to AI workloads.
The bottom line
ChatGPT is not magic. It's cloud infrastructure at an extraordinary scale. API gateways, load balancers, GPU clusters, caching layers, monitoring systems, and global networking every layer is an engineering discipline with career opportunities.
The models get the attention. The infrastructure gets the salaries. And right now, there are far more open infrastructure roles at AI companies than qualified engineers to fill them.
Frequently Asked Questions
Ola
Founder, CloudPros
Building the most hands-on DevOps bootcamp for the AI era. 16 weeks of real infrastructure, real projects, real career outcomes.
