In Day 2, I learned how Kubernetes discovers GPUs using Device Plugins.

Once a GPU is exposed as a resource, a Pod can request it just like CPU or memory.

That led me to a new question:

If Kubernetes can schedule GPUs, why is GPU utilization still such a big problem?

The answer is simple:

Most workloads don’t use an entire GPU.

The GPU Utilization Problem Link to heading

Imagine you have an NVIDIA A100 GPU with 80 GB of memory.

Now suppose your inference workload only needs:

  • 10 GB of GPU memory
  • 20% GPU utilization

Kubernetes still schedules the Pod like this:

Pod A
↓
1 GPU Requested
↓
1 GPU Allocated

The remaining GPU resources sit idle.

You paid for the entire GPU.

But you’re only using a fraction of it.

IMAGE: Single Pod occupying an entire GPU while most GPU resources remain unused

This is called resource fragmentation.

Why CPUs Are Different Link to heading

With CPUs, multiple Pods can share the same machine.

For example:

Node
├── Pod A (1 CPU)
├── Pod B (2 CPU)
├── Pod C (1 CPU)
└── Pod D (4 CPU)

Kubernetes is very good at packing workloads together.

GPUs are trickier.

Traditionally, a GPU is allocated exclusively to a single Pod.

That means:

GPU
└── Pod A

Even if Pod A only uses a small percentage of the GPU.

The Cost Problem Link to heading

This becomes expensive quickly.

Imagine:

  • 8 GPUs in a cluster
  • Every workload uses only 25% of a GPU

In theory, the cluster is only doing useful work with 2 GPUs.

The remaining compute capacity is wasted.

For organizations serving LLMs or running inference at scale, this directly translates into higher infrastructure costs.

The challenge isn’t getting GPUs.

The challenge is keeping them busy.

How The Industry Solves This Link to heading

To improve utilization, platforms use techniques such as:

Time Slicing Link to heading

Multiple workloads take turns using the same GPU.

Think of it like CPU time-sharing.

IMAGE: Multiple Pods sharing GPU through time slicing

For more detail:

https://run-ai-docs.nvidia.com/saas/platform-management/runai-scheduler/resource-optimization/time-slicing

MIG (Multi-Instance GPU) Link to heading

Some NVIDIA GPUs can be partitioned into smaller logical GPUs.

Instead of one large GPU:

80 GB GPU

You can create:

10 GB GPU
10 GB GPU
20 GB GPU
40 GB GPU

Each workload gets its own isolated slice.

IMAGE: MIG partitioning a large GPU into multiple smaller GPUs

For more details:

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/latest/introduction.html

Model Batching Link to heading

Inference systems combine multiple requests into a single GPU execution.

Instead of processing requests one by one, they process them together.

This keeps the GPU busier and improves throughput.

We’ll see this later when looking at vLLM.

Why This Matters For AI Platforms Link to heading

When I first started learning AI infrastructure, I assumed GPUs were the scarce resource.

Now I realize utilization is the scarce resource.

A cluster might have plenty of GPUs.

The real challenge is making sure they aren’t sitting idle.

This is why AI platform teams spend so much time thinking about:

  • GPU sharing
  • Resource fragmentation
  • Scheduling
  • Batching
  • Capacity planning

All of these exist for one reason:

GPUs are expensive.

Today’s Takeaway Link to heading

Kubernetes can schedule GPUs.

But scheduling them efficiently is a completely different challenge.

The goal isn’t simply to allocate GPUs.

The goal is to maximize utilization while keeping workloads performant.

Tomorrow I’ll explore another consequence of this problem:

Why AI workloads need different scheduling strategies than traditional applications.