In Day 2, I learned how Kubernetes discovers GPUs using Device Plugins.
Once a GPU is exposed as a resource, a Pod can request it just like CPU or memory.
That led me to a new question:
If Kubernetes can schedule GPUs, why is GPU utilization still such a big problem?
The answer is simple:
Most workloads don’t use an entire GPU.
The GPU Utilization Problem Link to heading
Imagine you have an NVIDIA A100 GPU with 80 GB of memory.
Now suppose your inference workload only needs:
- 10 GB of GPU memory
- 20% GPU utilization
Kubernetes still schedules the Pod like this:
Pod A
↓
1 GPU Requested
↓
1 GPU Allocated
The remaining GPU resources sit idle.
You paid for the entire GPU.
But you’re only using a fraction of it.
This is called resource fragmentation.
Why CPUs Are Different Link to heading
With CPUs, multiple Pods can share the same machine.
For example:
Node
├── Pod A (1 CPU)
├── Pod B (2 CPU)
├── Pod C (1 CPU)
└── Pod D (4 CPU)
Kubernetes is very good at packing workloads together.
GPUs are trickier.
Traditionally, a GPU is allocated exclusively to a single Pod.
That means:
GPU
└── Pod A
Even if Pod A only uses a small percentage of the GPU.
The Cost Problem Link to heading
This becomes expensive quickly.
Imagine:
- 8 GPUs in a cluster
- Every workload uses only 25% of a GPU
In theory, the cluster is only doing useful work with 2 GPUs.
The remaining compute capacity is wasted.
For organizations serving LLMs or running inference at scale, this directly translates into higher infrastructure costs.
The challenge isn’t getting GPUs.
The challenge is keeping them busy.
How The Industry Solves This Link to heading
To improve utilization, platforms use techniques such as:
Time Slicing Link to heading
Multiple workloads take turns using the same GPU.
Think of it like CPU time-sharing.

For more detail:
MIG (Multi-Instance GPU) Link to heading
Some NVIDIA GPUs can be partitioned into smaller logical GPUs.
Instead of one large GPU:
80 GB GPU
You can create:
10 GB GPU
10 GB GPU
20 GB GPU
40 GB GPU
Each workload gets its own isolated slice.

For more details:
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/latest/introduction.html
Model Batching Link to heading
Inference systems combine multiple requests into a single GPU execution.
Instead of processing requests one by one, they process them together.
This keeps the GPU busier and improves throughput.
We’ll see this later when looking at vLLM.
Why This Matters For AI Platforms Link to heading
When I first started learning AI infrastructure, I assumed GPUs were the scarce resource.
Now I realize utilization is the scarce resource.
A cluster might have plenty of GPUs.
The real challenge is making sure they aren’t sitting idle.
This is why AI platform teams spend so much time thinking about:
- GPU sharing
- Resource fragmentation
- Scheduling
- Batching
- Capacity planning
All of these exist for one reason:
GPUs are expensive.
Today’s Takeaway Link to heading
Kubernetes can schedule GPUs.
But scheduling them efficiently is a completely different challenge.
The goal isn’t simply to allocate GPUs.
The goal is to maximize utilization while keeping workloads performant.
Tomorrow I’ll explore another consequence of this problem:
Why AI workloads need different scheduling strategies than traditional applications.