What infrastructure does ChatGPT run on?

ChatGPT runs on Microsoft Azure's cloud infrastructure, using thousands of NVIDIA GPUs (A100s and H100s) orchestrated by Kubernetes. The stack includes API gateways, load balancers, inference servers, caching layers, and comprehensive monitoring systems.

How many GPUs does ChatGPT use?

OpenAI is estimated to use over 10,000 NVIDIA GPUs for ChatGPT inference alone, with separate clusters for model training. Each inference request requires GPU compute to generate tokens, and the system serves hundreds of millions of users weekly.

How much does it cost to run ChatGPT?

Estimates suggest ChatGPT costs $700,000 or more per day to operate, primarily in GPU compute costs. This includes inference serving, model hosting, monitoring, networking, and storage across multiple data centre regions.

What jobs are created by AI infrastructure like ChatGPT?

AI infrastructure creates demand for DevOps engineers, cloud architects, SREs, platform engineers, MLOps engineers, and GPU infrastructure specialists. These roles manage the deployment, scaling, monitoring, and cost optimisation of AI systems.

Do you need a PhD to work on AI infrastructure?

No. AI infrastructure roles require cloud and DevOps skills Kubernetes, Docker, Terraform, CI/CD, monitoring, and cloud platforms. These are infrastructure engineering skills, not machine learning research skills. A 4-6 month focused learning path can prepare you.

The Infrastructure Behind ChatGPT: What It Actually Takes

ChatGPT serves over 800 million users per week. Every time you type a prompt and receive a response, an extraordinary amount of cloud infrastructure is working behind the scenes. The models get the headlines. The infrastructure makes them work.

This article breaks down exactly what that infrastructure looks like from the GPU clusters that run inference to the monitoring systems that keep everything reliable. It's a technical but accessible guide to the most expensive, most complex, and most career-relevant infrastructure in tech right now.

The request journey: what happens when you send a prompt

When you type a prompt into ChatGPT, your request travels through at least six infrastructure layers before you see a response. Each layer is a cloud engineering problem.

Layer 1: API Gateway

Your HTTPS request first hits an API gateway. This handles:

Authentication verifying your API key or session token
Rate limiting preventing abuse and managing capacity
Request routing directing traffic to the correct model and version
TLS termination decrypting the secure connection

API gateways at this scale handle billions of requests per day. They need to be highly available (99.99%+ uptime), globally distributed, and fast enough to add negligible latency.

Infrastructure skill: Load balancer configuration, TLS management, API gateway design (Kong, AWS API Gateway, custom solutions).

Layer 2: Load Balancer

After the gateway, a load balancer distributes requests across inference servers. This isn't round-robin balancing it's intelligent routing based on:

GPU memory availability is the model loaded on this server?
Current queue depth how many requests is each server processing?
Geographic proximity route to the nearest data centre
Model version different users might be served different model versions (A/B testing)

Load balancing for AI inference is harder than traditional web traffic because each request consumes significant GPU compute and takes longer to process (seconds, not milliseconds).

Infrastructure skill: Load balancer configuration, health checks, traffic routing, capacity planning.

Layer 3: Inference Cluster (the GPU layer)

This is where the actual computation happens. ChatGPT's inference cluster consists of thousands of NVIDIA GPUs primarily A100s and H100s distributed across multiple data centres.

What each inference server looks like:

Component	Specification
GPUs	4-8 NVIDIA H100s per node
GPU memory	80GB per GPU (320-640GB per node)
RAM	512GB 2TB system memory
Networking	InfiniBand for GPU-to-GPU, 100Gbps Ethernet
Storage	NVMe SSDs for model weight loading

How inference works:

Model weights (hundreds of gigabytes) are pre-loaded into GPU memory
Your prompt is tokenised and sent to the inference server
The model generates tokens one at a time, each requiring a forward pass through the neural network
Tokens stream back to you in real time (that's why you see the response appear word by word)

A single inference request for a long conversation can consume a full GPU for several seconds. Multiply that by hundreds of millions of weekly users, and you understand why the GPU cluster is so massive.

Infrastructure skill: Kubernetes GPU scheduling, NVIDIA device plugin, node affinity, resource limits, cluster autoscaling.

Layer 4: Caching and Optimisation

Not every request needs a full GPU inference pass. Caching layers significantly reduce cost and latency:

KV cache stores intermediate computation from earlier tokens in a conversation, so the model doesn't recompute them
Prompt caching common system prompts can be pre-computed
Semantic caching similar queries can sometimes reuse previous responses
CDN caching static assets (the UI, images) served from edge locations

Caching at this scale saves millions of dollars per month. A 10% improvement in cache hit rate can reduce GPU costs by tens of thousands per day.

Infrastructure skill: Redis/Memcached deployment, CDN configuration, cache invalidation strategies.

Layer 5: Monitoring and Observability

Running thousands of GPUs serving hundreds of millions of users requires comprehensive monitoring:

What gets monitored:

Metric	Why It Matters
Inference latency (p50, p95, p99)	User experience slow responses lose users
GPU utilisation	Cost efficiency idle GPUs waste money
GPU temperature	Hardware health overheating causes throttling or failure
GPU memory usage	Capacity out-of-memory crashes kill requests
Token throughput	System capacity tokens generated per second
Error rate	Reliability failed requests need immediate attention
Queue depth	Scaling signal growing queues mean more capacity needed
Cost per request	Business metric the bottom line

The monitoring stack:

DCGM Exporter sends GPU metrics to Prometheus
Prometheus stores time-series metrics
Grafana visualises dashboards
Custom alerting triggers PagerDuty for on-call engineers
Distributed tracing tracks requests across all layers

When something goes wrong a GPU fails, latency spikes, a data centre has a network issue the monitoring system detects it, alerts the on-call team, and often triggers automated remediation before users notice.

Infrastructure skill: Prometheus, Grafana, alerting rules, distributed tracing, incident response.

Layer 6: Networking and Global Distribution

ChatGPT serves users worldwide. The networking layer handles:

Multi-region deployment inference clusters in multiple geographic regions
DNS routing directing users to the nearest healthy region
Inter-region replication keeping model weights synchronised across data centres
Network security DDoS protection, firewall rules, encryption in transit
Bandwidth management streaming responses to millions of concurrent users

The networking alone is a full-time job for multiple teams. Global low-latency delivery of GPU-computed responses is among the hardest networking problems in tech.

Infrastructure skill: VPC design, DNS management, CDN configuration, network security, multi-region architecture.

The cost: why AI infrastructure is expensive

Running ChatGPT-scale infrastructure is extraordinarily expensive:

Cost Category	Estimated Daily Cost
GPU compute (inference)	$500,000 $700,000+
Cloud networking and bandwidth	$50,000 $100,000
Storage (model weights, logs, caches)	$20,000 $50,000
Monitoring and observability tooling	$10,000 $20,000
Engineering team salaries	Hundreds of engineers

A single NVIDIA H100 costs approximately $30,000 to purchase or $3-4 per hour to rent from cloud providers. A training run for a frontier model uses 10,000+ GPUs for weeks. The numbers are staggering.

This is exactly why companies pay $150,000-$250,000+ for engineers who can optimise these costs. A 5% improvement in GPU utilisation across a 10,000-GPU cluster saves millions per year.

The career opportunity nobody talks about

The media focuses on AI models who trained the biggest model, which chatbot is best, what AI can generate. But behind every model is an infrastructure team that's typically 3-5x larger than the research team.

Roles this infrastructure creates:

DevOps Engineers build and maintain the CI/CD pipelines that deploy model updates
Cloud Architects design the multi-region GPU infrastructure
SREs keep the system reliable and respond to incidents
Platform Engineers build internal tools for ML engineers
MLOps Engineers manage the model lifecycle from training to production
GPU Infrastructure Specialists optimise GPU scheduling, costs, and performance

None of these roles require a machine learning PhD. They require cloud and DevOps skills Kubernetes, Docker, Terraform, monitoring, CI/CD, and cloud platform expertise. The same skills that run traditional web infrastructure, applied to the most demanding workloads in tech.

The demand for these roles is growing faster than any other category in tech, with 28-41% year-over-year increases in job postings. The supply of qualified engineers is not keeping up.

What you'd need to learn

To work on infrastructure like this, here's the skill stack:

Linux every server in this architecture runs Linux
Networking understanding TCP/IP, DNS, load balancing, VPCs
Docker containerising applications and model serving code
Kubernetes orchestrating containers, GPU scheduling, auto-scaling
Cloud platform (AWS/Azure/GCP) compute, storage, networking services
Terraform defining infrastructure as code
CI/CD automating deployment pipelines
Monitoring Prometheus, Grafana, alerting, observability
Python automation, scripting, cloud API integrations
Security network security, IAM, encryption, compliance

This is the full cloud and DevOps learning path. It takes 4-6 months of focused learning and maps directly to the roles that AI companies are hiring for.

The complete guide to AI infrastructure covers how each of these skills applies specifically to AI workloads.

The bottom line

ChatGPT is not magic. It's cloud infrastructure at an extraordinary scale. API gateways, load balancers, GPU clusters, caching layers, monitoring systems, and global networking every layer is an engineering discipline with career opportunities.

The models get the attention. The infrastructure gets the salaries. And right now, there are far more open infrastructure roles at AI companies than qualified engineers to fill them.