Skip to main content

Command Palette

Search for a command to run...

How Go's Scheduler Actually Works: A Deep Dive into the GMP Model

Updated
11 min read

You’ve probably heard that goroutines are “lightweight threads.” But what does that actually mean? How does Go run millions of goroutines on just a handful of OS threads? The answer is the GMP scheduler — one of the most elegant parts of Go’s runtime.

The Problem Go Had to Solve

In traditional languages like C or Java, each thread maps directly to an OS thread. The kernel handles switching between them. This works, but the cost is brutal.

OS threads require large fixed stacks — often megabytes of reserved virtual memory per thread. Spawn 10,000 threads and you're reserving gigabytes just for stacks. Thread creation involves system calls and kernel setup — microseconds per thread. Context switching requires entering the kernel, saving registers, flushing caches.

Goroutines flip this entirely. They start with tiny stacks (usually a few KB, implementation-dependent) and grow or shrink dynamically as needed. Creation is dramatically cheaper than creating OS threads and is designed to support massive concurrency. Scheduling decisions for goroutines happen mostly in user-space, avoiding the overhead of managing one OS thread per concurrent task.

Go wanted developers to spawn millions of concurrent tasks cheaply.
The solution: don't map goroutines to OS threads 1:1. Instead, multiplex many goroutines onto a small pool of OS threads using a user-space scheduler built into the runtime.

Go Scheduler ( The GMP Model )

The Go scheduler is the part of the Go runtime that efficiently executes goroutines by mapping them onto OS threads and CPU cores.

G — Goroutine A lightweight unit of execution. Each goroutine has its own stack (starts ~2KB, grows dynamically), program counter, and state. Goroutines are cheap — you can spawn millions of them.

M — Machine (OS Thread) An M is an actual kernel thread. It's the thing your CPU core runs. Ms are expensive because OS threads require kernel-managed stacks (often megabytes of reserved virtual memory depending on platform), kernel scheduling metadata, and more expensive creation and context switching. Go keeps the number of Ms low.

P — Processor P is the scheduler's key innovation. It holds a local run queue of goroutines and memory allocator caches. The number of Ps is set by GOMAXPROCS (defaults to your CPU core count).

The key relationship: M must hold a P to run goroutines.

How They Connect

G, M, and P are not independent — the scheduler constantly coordinates all three to keep every CPU core doing useful work. Here's how they interact at runtime.

Each P holds a Local Run Queue (LRQ) of goroutines. When you call go func(), the new goroutine is pushed into the local queue of the P that created it. The M (OS thread) bound to that P picks goroutines from this queue one by one and executes them. One goroutine runs at a time on each M, while the rest wait in the queue for their turn. Since only the owning P accesses its own queue, no lock is needed — this is the fast path that keeps scheduling overhead near zero.

The Global Run Queue (GRQ) acts as a shared fallback. Not every goroutine lives in a local queue. When a P's local queue is full (currently around 256 goroutines in the runtime implementation), the runtime offloads half of them into the GRQ. Goroutines returning from blocking syscalls with no available P also land here. Every P periodically checks the GRQ to pick up waiting goroutines — this prevents any goroutine from being stuck in the global queue forever.

Work stealing keeps all processors equally busy. If one P finishes all its local work while another P has a backlog of goroutines queued up, the idle P doesn't just park its thread and waste a CPU core. Instead, it reaches into the busy P's local queue and takes roughly half of its goroutines. This load balancing happens automatically — you never write code to trigger it and no configuration is needed. The runtime ensures that as long as goroutines exist somewhere in the system, no P sits idle.

Blocking syscalls trigger a P handoff. This is the most critical interaction. When a goroutine makes a blocking syscall — like reading a file — the OS thread executing it gets stuck in the kernel. If the scheduler did nothing, the P attached to that thread would also be stuck, wasting a CPU slot. So the moment a syscall begins, the runtime detaches P from the blocked thread and hands it to an idle thread (or creates a new one). The P continues executing other goroutines immediately. The blocked thread waits alone in the kernel with no P — it cannot run anything else. When the syscall returns, the thread tries to reclaim a P. If none is free, the goroutine goes to the GRQ and the thread parks itself.

The result: at any given moment, every P is either executing a goroutine or actively looking for one (checking its local queue, checking the GRQ, or stealing from another P). No CPU cycle is wasted waiting on I/O, and no goroutine is forgotten in a queue. This is what makes go func() so powerful — behind that one keyword, the entire GMP machinery coordinates to run your code as efficiently as the hardware allows.

Full Scheduling Flow

Now let's trace a real program through the scheduler.

go task1()
go task2()
go task3()

Consider this code with GOMAXPROCS=2:

Step 1: main spawns G1, G2, G3
main runs on P1/M1. Each go statement creates a goroutine and pushes it into P1's local queue. After all three calls, P1 holds G1, G2, and G3. P2's queue is empty — M2 is idle. All three goroutines landed on P1 because new goroutines always go to the spawning P's local queue first.

Step 2: P2 steals work from P1
P2 detects that P1 has a backlog while it has nothing. M2 steals roughly half — G2 and G3. P2 immediately starts executing G2. G3 waits in P2's local queue. P1 is left with G1 and starts executing it once main blocks on wg.Wait().

Step 3: True parallel execution
Both processors are now running simultaneously on separate CPU cores. P1/M1 executes G1, P2/M2 executes G2, G3 waits in P2's queue. This is real parallelism — two goroutines running at the exact same time on two cores.

Step 4: G2 finishes, G3 takes over
G2 completes. P2 immediately picks up G3 from its local queue and starts running it — no delay, no kernel involvement, entirely user-space. P1 is still running G1. Both cores stay busy.

Step 5: Threads park
G1 finishes. P1 checks the GRQ (empty), tries to steal from P2 (nothing to steal). M1 parks — sleeps cheaply until new work appears. G3 finishes, M2 parks. wg.Wait() unblocks, main returns, program exits.

What Happens When a Goroutine Makes a Blocking Syscall?

You've seen how the scheduler runs goroutines normally — P picks a G from its local queue, hands it to M, M executes it on a CPU core. Clean and fast.

But what happens when the goroutine needs to do something the Go runtime has no control over — like reading a file from disk? This is a blocking syscall. The goroutine asks the operating system kernel to perform an operation, and the kernel says "wait, I'll get back to you when the disk responds." The OS thread (M) that made this call is now frozen — it cannot do anything else until the kernel returns.

This creates a serious problem. Remember the golden rule: M must hold a P to run goroutines. If M is frozen in the kernel, the P attached to it is also stuck. That P has other goroutines in its local queue waiting to run. If the scheduler did nothing, an entire CPU core would sit idle just because one goroutine decided to read a file.

Go solves this with a technique called P handoff.

Now let's say task2() hits a file.Read() — a blocking syscall. This is where the scheduler gets really clever.

Phase 1 — Before the syscall. G2 is running task2() on P2/M2. G3 is waiting in P2's local queue. Everything is normal.

Phase 2 — The moment the syscall starts. Right before M2 enters the kernel, the runtime calls entersyscall(). This immediately detaches P2 from M2. P2 takes G3 (its local queue) and binds to an idle thread M3 — or creates a new one if none exists. P2 is now fully operational again, running G3 on M3. Meanwhile, M2 descends into the kernel alone. It has no P. It cannot execute any other goroutine. It sits there with G2 frozen on it, waiting for the disk to respond.

Phase 3 — The syscall returns. The kernel finishes the file read and wakes M2. The goroutine G2 is ready to resume, but M2 has no P. Without a P, it cannot execute anything. So it tries three things in order:

  1. check if P2 is free and grab it back (fast path).

  2. check if any other P is idle and take it.

  3. if every P is busy — place G2 on the global run queue and park M2 to sleep.

Why this is brilliant: the cost of a blocking syscall in most languages is an entire thread sitting idle, consuming ~1MB of memory, doing nothing. In Go, the thread still blocks — but the P never does. The P escapes the moment the syscall starts and continues doing useful work on a different thread. Your program runs at full speed on all cores while the syscall waits in the background.

Network I/O: The Netpoller

Network calls work completely differently. Go wraps them using a netpollerepoll on Linux, kqueue on macOS.

When a goroutine tries to read from a socket and no data is ready, Go sets the socket to non-blocking mode, registers it with the netpoller, and parks the goroutine — it goes off-CPU, but the OS thread is NOT blocked. The M picks up the next goroutine and keeps running. When data arrives, the netpoller wakes the goroutine and puts it back on a run queue.

This is why Go can handle hundreds of thousands of concurrent connections without thousands of threads.

Preemption: Stopping Greedy Goroutines

Every scheduling system has a fundamental problem: what happens when a goroutine refuses to stop?

Consider this:

func greedy() {
    for {
        // tight loop
        // no function calls
        // no I/O
    }
}

If the scheduler has no way to interrupt this goroutine, it holds the P forever. Every other goroutine on that P starves. On a single-core machine, the entire program freezes.

Go solves this with two layers of preemption.

  1. Layer 1 — Cooperative Preemption (historically)
    Earlier versions of Go relied primarily on cooperative preemption. Goroutines could yield at safe points such as function calls, stack growth checks, channel operations, allocations, and certain runtime interactions.

    This worked well for most real-world programs because normal code naturally hits these points frequently.

    But it had one major weakness: tight CPU-bound loops with no function calls or runtime interactions could run for too long without yielding, causing scheduler unfairness.

  2. Layer 2 — Asynchronous Preemption (Go 1.14+)
    Go 1.14 introduced asynchronous preemption to solve this problem.

    A background runtime thread called sysmon continuously monitors running goroutines. If a goroutine runs for too long without reaching a natural yield point (roughly 10ms), the runtime requests preemption using signal-assisted mechanisms and safe-point checks.

    The goroutine is paused, its execution state is saved, and the P is returned to the scheduler so other goroutines can run.

    This prevents CPU-heavy goroutines from monopolizing a processor and ensures scheduler fairness even for tight loops.

What is sysmon?
sysmon runs on a dedicated runtime thread that does not require a P, meaning it never competes with goroutines for CPU time. It is the runtime's watchdog — responsible for preemption enforcement, netpoller polling, detecting threads stuck in syscalls, and assisting garbage collection. It runs in a continuous loop, sleeping between checks, and intervenes only when the scheduler needs help.

Wrapping Up

Everything we covered — local queues, the global fallback, work stealing, P handoff — exists to enforce one rule: no processor should ever sit idle while goroutines are waiting to run.

The next time you write go func(), you'll know exactly what happens underneath. A goroutine is created, placed in a queue, picked up by a thread, and executed on a core. If that thread blocks, the processor escapes. If a queue empties, work gets stolen. No wasted cores, no manual thread management.

Simple on the surface. Deeply engineered underneath. That's the GMP model.

All diagrams used in this post are available on Excalidraw — feel free to use or modify them.