AI Infrastructure

The Infrastructure Behind ChatGPT: What It Actually Takes

Kunle··10 min read

ChatGPT serves over 800 million users per week. Every time you type a prompt and receive a response, an extraordinary amount of cloud infrastructure is working behind the scenes. The models get the headlines. The infrastructure makes them work.

This article breaks down exactly what that infrastructure looks like from the GPU clusters that run inference to the monitoring systems that keep everything reliable. It's a technical but accessible guide to the most expensive, most complex, and most career-relevant infrastructure in tech right now.

The request journey: what happens when you send a prompt

When you type a prompt into ChatGPT, your request travels through at least six infrastructure layers before you see a response. Each layer is a cloud engineering problem.

Layer 1: API Gateway

Your HTTPS request first hits an API gateway. This handles:

  • Authentication verifying your API key or session token
  • Rate limiting preventing abuse and managing capacity
  • Request routing directing traffic to the correct model and version
  • TLS termination decrypting the secure connection

API gateways at this scale handle billions of requests per day. They need to be highly available (99.99%+ uptime), globally distributed, and fast enough to add negligible latency.

Infrastructure skill: Load balancer configuration, TLS management, API gateway design (Kong, AWS API Gateway, custom solutions).

Layer 2: Load Balancer

After the gateway, a load balancer distributes requests across inference servers. This isn't round-robin balancing it's intelligent routing based on:

  • GPU memory availability is the model loaded on this server?
  • Current queue depth how many requests is each server processing?
  • Geographic proximity route to the nearest data centre
  • Model version different users might be served different model versions (A/B testing)

Load balancing for AI inference is harder than traditional web traffic because each request consumes significant GPU compute and takes longer to process (seconds, not milliseconds).

Infrastructure skill: Load balancer configuration, health checks, traffic routing, capacity planning.

Layer 3: Inference Cluster (the GPU layer)

This is where the actual computation happens. ChatGPT's inference cluster consists of thousands of NVIDIA GPUs primarily A100s and H100s distributed across multiple data centres.

What each inference server looks like:

ComponentSpecification
GPUs4-8 NVIDIA H100s per node
GPU memory80GB per GPU (320-640GB per node)
RAM512GB 2TB system memory
NetworkingInfiniBand for GPU-to-GPU, 100Gbps Ethernet
StorageNVMe SSDs for model weight loading

How inference works:

  1. Model weights (hundreds of gigabytes) are pre-loaded into GPU memory
  2. Your prompt is tokenised and sent to the inference server
  3. The model generates tokens one at a time, each requiring a forward pass through the neural network
  4. Tokens stream back to you in real time (that's why you see the response appear word by word)

A single inference request for a long conversation can consume a full GPU for several seconds. Multiply that by hundreds of millions of weekly users, and you understand why the GPU cluster is so massive.

Infrastructure skill: Kubernetes GPU scheduling, NVIDIA device plugin, node affinity, resource limits, cluster autoscaling.

Layer 4: Caching and Optimisation

Not every request needs a full GPU inference pass. Caching layers significantly reduce cost and latency:

  • KV cache stores intermediate computation from earlier tokens in a conversation, so the model doesn't recompute them
  • Prompt caching common system prompts can be pre-computed
  • Semantic caching similar queries can sometimes reuse previous responses
  • CDN caching static assets (the UI, images) served from edge locations

Caching at this scale saves millions of dollars per month. A 10% improvement in cache hit rate can reduce GPU costs by tens of thousands per day.

Infrastructure skill: Redis/Memcached deployment, CDN configuration, cache invalidation strategies.

Layer 5: Monitoring and Observability

Running thousands of GPUs serving hundreds of millions of users requires comprehensive monitoring:

What gets monitored:

MetricWhy It Matters
Inference latency (p50, p95, p99)User experience slow responses lose users
GPU utilisationCost efficiency idle GPUs waste money
GPU temperatureHardware health overheating causes throttling or failure
GPU memory usageCapacity out-of-memory crashes kill requests
Token throughputSystem capacity tokens generated per second
Error rateReliability failed requests need immediate attention
Queue depthScaling signal growing queues mean more capacity needed
Cost per requestBusiness metric the bottom line

The monitoring stack:

  • DCGM Exporter sends GPU metrics to Prometheus
  • Prometheus stores time-series metrics
  • Grafana visualises dashboards
  • Custom alerting triggers PagerDuty for on-call engineers
  • Distributed tracing tracks requests across all layers

When something goes wrong a GPU fails, latency spikes, a data centre has a network issue the monitoring system detects it, alerts the on-call team, and often triggers automated remediation before users notice.

Infrastructure skill: Prometheus, Grafana, alerting rules, distributed tracing, incident response.

Layer 6: Networking and Global Distribution

ChatGPT serves users worldwide. The networking layer handles:

  • Multi-region deployment inference clusters in multiple geographic regions
  • DNS routing directing users to the nearest healthy region
  • Inter-region replication keeping model weights synchronised across data centres
  • Network security DDoS protection, firewall rules, encryption in transit
  • Bandwidth management streaming responses to millions of concurrent users

The networking alone is a full-time job for multiple teams. Global low-latency delivery of GPU-computed responses is among the hardest networking problems in tech.

Infrastructure skill: VPC design, DNS management, CDN configuration, network security, multi-region architecture.

The cost: why AI infrastructure is expensive

Running ChatGPT-scale infrastructure is extraordinarily expensive:

Cost CategoryEstimated Daily Cost
GPU compute (inference)$500,000 $700,000+
Cloud networking and bandwidth$50,000 $100,000
Storage (model weights, logs, caches)$20,000 $50,000
Monitoring and observability tooling$10,000 $20,000
Engineering team salariesHundreds of engineers

A single NVIDIA H100 costs approximately $30,000 to purchase or $3-4 per hour to rent from cloud providers. A training run for a frontier model uses 10,000+ GPUs for weeks. The numbers are staggering.

This is exactly why companies pay $150,000-$250,000+ for engineers who can optimise these costs. A 5% improvement in GPU utilisation across a 10,000-GPU cluster saves millions per year.

The career opportunity nobody talks about

The media focuses on AI models who trained the biggest model, which chatbot is best, what AI can generate. But behind every model is an infrastructure team that's typically 3-5x larger than the research team.

Roles this infrastructure creates:

  • DevOps Engineers build and maintain the CI/CD pipelines that deploy model updates
  • Cloud Architects design the multi-region GPU infrastructure
  • SREs keep the system reliable and respond to incidents
  • Platform Engineers build internal tools for ML engineers
  • MLOps Engineers manage the model lifecycle from training to production
  • GPU Infrastructure Specialists optimise GPU scheduling, costs, and performance

None of these roles require a machine learning PhD. They require cloud and DevOps skills Kubernetes, Docker, Terraform, monitoring, CI/CD, and cloud platform expertise. The same skills that run traditional web infrastructure, applied to the most demanding workloads in tech.

The demand for these roles is growing faster than any other category in tech, with 28-41% year-over-year increases in job postings. The supply of qualified engineers is not keeping up.

What you'd need to learn

To work on infrastructure like this, here's the skill stack:

  1. Linux every server in this architecture runs Linux
  2. Networking understanding TCP/IP, DNS, load balancing, VPCs
  3. Docker containerising applications and model serving code
  4. Kubernetes orchestrating containers, GPU scheduling, auto-scaling
  5. Cloud platform (AWS/Azure/GCP) compute, storage, networking services
  6. Terraform defining infrastructure as code
  7. CI/CD automating deployment pipelines
  8. Monitoring Prometheus, Grafana, alerting, observability
  9. Python automation, scripting, cloud API integrations
  10. Security network security, IAM, encryption, compliance

This is the full cloud and DevOps learning path. It takes 4-6 months of focused learning and maps directly to the roles that AI companies are hiring for.

The complete guide to AI infrastructure covers how each of these skills applies specifically to AI workloads.

The bottom line

ChatGPT is not magic. It's cloud infrastructure at an extraordinary scale. API gateways, load balancers, GPU clusters, caching layers, monitoring systems, and global networking every layer is an engineering discipline with career opportunities.

The models get the attention. The infrastructure gets the salaries. And right now, there are far more open infrastructure roles at AI companies than qualified engineers to fill them.

Frequently Asked Questions