Kotlin Coroutines Under the Hood - Dispatchers, Thread Pools, and Work Stealing | StaticVar

Prerequisites and assumptions

Note: This article reflects how the kotlinx.coroutines scheduler works as of February 2026. Internal implementation details can change between releases.

You have written Kotlin coroutines before. You know what launch, async, withContext, and suspend do at a basic level. You’ve used Dispatchers.Default, Dispatchers.IO, and Dispatchers.Main in at least one project. This article is about the machinery underneath those abstractions.

One scheduler, not two

Most people get this wrong: Dispatchers.Default and Dispatchers.IO are not backed by separate thread pools. They share a single CoroutineScheduler instance. The same worker threads execute tasks from both dispatchers.

Open Dispatcher.kt in the kotlinx.coroutines source and you’ll find this:

internal object DefaultScheduler : SchedulerCoroutineDispatcher(
    CORE_POOL_SIZE, MAX_POOL_SIZE,
    IDLE_WORKER_KEEP_ALIVE_NS, DEFAULT_SCHEDULER_NAME
)

And Dispatchers.IO is defined as:

private val default = UnlimitedIoScheduler.limitedParallelism(
    systemProp(IO_PARALLELISM_PROPERTY_NAME, 64.coerceAtLeast(AVAILABLE_PROCESSORS))
)

UnlimitedIoScheduler dispatches every task into the same DefaultScheduler but marks it with a blocking context flag, literally a BlockingContext boolean. That flag is the entire distinction between an IO task and a CPU task inside the scheduler.

This design exists because creating and destroying threads is expensive. Sharing one pool removes the overhead of maintaining separate pools and cuts down on thread context switches when a coroutine moves from withContext(Dispatchers.Default) to withContext(Dispatchers.IO). The scheduler tries to keep execution on the same physical thread when possible.

Blocking versus non-blocking tasks

The scheduler classifies every task with a single boolean: is it blocking or not?

internal typealias TaskContext = Boolean
internal const val NonBlockingContext: TaskContext = false
internal const val BlockingContext: TaskContext = true

That’s literally it. A Boolean. When you dispatch work on Dispatchers.Default, tasks get NonBlockingContext. When you dispatch on Dispatchers.IO, tasks get BlockingContext.

Why does this matter? The scheduler treats these two categories very differently.

Non-blocking (CPU) tasks are expected to actively use the processor (compute, transform, serialize) and yield quickly. They should not call Thread.sleep(), perform file I/O, make network calls, or wait on locks. The scheduler gives them a limited number of threads and expects them to cooperate.

Blocking (IO) tasks use the CPU too, but they spend most of their time waiting on external resources: a database response, a file read, a network socket. The thread is alive but sitting idle while the disk or network does its thing. If these tasks ran on the limited CPU pool, they’d hold threads hostage doing nothing, starving actual compute work. The scheduler compensates by letting blocking tasks release their CPU permit and spinning up additional threads so CPU work doesn’t stall.

flowchart TD
    subgraph Your Code
        A["launch(Default) { cpuWork() }"]
        B["launch(IO) { networkCall() }"]
    end

    A -->|"tagged: non-blocking"| S
    B -->|"tagged: blocking"| S

    subgraph S[Shared CoroutineScheduler]
        direction TB
        CPU["CPU Workers\nlimited to core count\nhold CPU permits"]
        BLK["Blocking Workers\nscale up on demand\nno CPU permit needed"]
    end

    S --> NOTE["Same thread pool.\nDifferent rules per tag."]

CPU permits: how the scheduler protects your cores

The scheduler uses a concept called CPU permits to limit how many threads can run CPU-bound work simultaneously. The number of permits equals corePoolSize, which defaults to the number of available processors (with a minimum of 2).

The key invariant from the source code:

“scheduler always has at least min(pending CPU tasks, core pool size) and at most core pool size threads to execute CPU tasks”

When a worker picks up a blocking task, it releases its CPU permit back to the pool. This tells the scheduler “I’m going to be stuck waiting, let someone else do CPU work.” The scheduler can then wake up or create another worker to grab that freed permit.

private fun executeTask(task: Task) {
    if (task.isBlocking) {
        if (tryReleaseCpu(WorkerState.BLOCKING)) {
            signalCpuWork()
        }
        runSafely(task)
        decrementBlockingTasks()
    } else {
        runSafely(task)
    }
}

This is why you should never perform blocking operations on Dispatchers.Default. If you call Thread.sleep(5000) on a Default-dispatched coroutine, you’re holding a CPU permit hostage while the thread sits idle. The scheduler has no idea you’re waiting on something external. It thinks your thread is actively computing and won’t compensate by creating new workers. On a 4-core machine, you’ve just starved 25% of your CPU capacity for five seconds.

Parallelism limits: Default versus IO

This is where the numbers get interesting.

Property	Dispatchers.Default	Dispatchers.IO
Thread pool	Shared CoroutineScheduler	Shared CoroutineScheduler
Task context	`NonBlockingContext`	`BlockingContext`
Parallelism cap	Number of CPU cores (min 2)	`max(64, number of CPU cores)`
Can exceed cap	No	Yes, via `limitedParallelism()`
Thread behavior	Bounded by CPU permits, no new threads on blocking	Releases CPU permit, triggers thread compensation
System property	`kotlinx.coroutines.scheduler.core.pool.size`	`kotlinx.coroutines.io.parallelism`

On a typical 8-core machine, Dispatchers.Default allows 8 concurrent CPU tasks. Dispatchers.IO defaults to 64 concurrent blocking tasks. But the underlying scheduler can have far more threads alive simultaneously, because each blocking task triggers thread compensation.

The IO dispatcher also has a unique elasticity property. When you call Dispatchers.IO.limitedParallelism(100), you get a limited parallelism view that is not capped by the IO dispatcher’s 64-thread limit. These views are additive:

val dbDispatcher = Dispatchers.IO.limitedParallelism(100)
val fileDispatcher = Dispatchers.IO.limitedParallelism(60)

During peak load, the system can have up to 64 + 100 + 60 = 224 concurrent blocking tasks across these views, subject to the scheduler’s max pool size and actual demand. During idle periods, all these views share the same small set of threads.

Compare this to Dispatchers.Default.limitedParallelism(n). If n >= corePoolSize, it just returns Dispatchers.Default itself. There’s no elasticity. The CPU pool is a hard ceiling.

Workers and their queues

So far we’ve talked about dispatchers, permits, and parallelism limits: the rules of the system. Now let’s look at who actually does the work.

Every thread in the CoroutineScheduler is a Worker. Worker is an inner class that extends Thread. A worker is not something that runs on a thread; it is the thread, with extra state bolted on: a local task queue, a CPU permit flag, a parking state. When you see “worker” and “thread” in this article, they mean the same thing inside the scheduler. The distinction only matters if you’re reading the source: Worker is the type, Thread is the superclass.

Each worker owns a local work queue, and the scheduler maintains two global queues shared across all workers. Here’s how everything fits together:

flowchart TD
    subgraph SCHED["CoroutineScheduler (single shared instance)"]
        direction TB
        subgraph GLOBAL["Global Queues (shared)"]
            GC["Global CPU Queue\n(tasks from outside)"]
            GB["Global Blocking Queue\n(tasks from outside)"]
        end

        subgraph WORKERS["Workers (each one IS a thread)"]
            direction LR
            subgraph W1["Worker 1"]
                LST1["lastScheduledTask"]
                RB1["Ring Buffer\n128 slots"]
            end
            subgraph W2["Worker 2"]
                LST2["lastScheduledTask"]
                RB2["Ring Buffer\n128 slots"]
            end
            subgraph WN["Worker N"]
                LSTN["lastScheduledTask"]
                RBN["Ring Buffer\n128 slots"]
            end
        end

        PS["Parked Workers Stack\n(idle workers waiting)"]
    end

    DD["Dispatchers.Default"] -->|"tags: non-blocking"| SCHED
    DI["Dispatchers.IO"] -->|"tags: blocking"| SCHED

    subgraph MAIN["Dispatchers.Main (separate)"]
        direction LR
        MQ["Handler → MessageQueue\n(single UI thread)"]
    end

Dispatchers.Default and IO feed into the same scheduler. The scheduler distributes tasks to workers. Each worker has its own local queue. Dispatchers.Main is entirely separate. It’s a single thread with an event loop, not part of the CoroutineScheduler at all.

Local work queue

A worker’s local queue is built on a ring buffer. Think of it as a fixed-size array (128 slots) where the end connects back to the beginning, forming a circle. Two counters track state: a write counter (where new tasks go in) and a read counter (where tasks come out). Both counters only go up and never reset. To find the actual array slot, the scheduler masks the counter with index & 127, which maps any counter value back into the 0–127 range. This is the “ring” part: counters climb forever, but array access wraps around automatically. The buffer never grows or shrinks. It reuses the same slots over and over without allocating new memory.

flowchart LR
    subgraph RB["Ring Buffer (128 slots, wraps around)"]
        direction LR
        S0["task A"] ~~~ S1["task B"] ~~~ S2["task C"] ~~~ S3["···"] ~~~ S4["empty"] ~~~ S5["empty"]
    end

    CI["read counter ➜"] -.-> S0
    PI["write counter ➜"] -.-> S4

    style S0 fill:#7C3AED,color:#fff,stroke:#7C3AED
    style S1 fill:#7C3AED,color:#fff,stroke:#7C3AED
    style S2 fill:#7C3AED,color:#fff,stroke:#7C3AED

Tasks are read from the left (oldest first) and written at the right. When task A is picked up, the read counter advances. When a new task is added, it goes at the write position and the write counter advances. Both counters keep climbing, and the modulo mapping keeps them cycling through the same 128 slots.

Only the owning worker writes to this buffer. But any worker in the pool can read from it, and that’s stealing. Since there’s one writer and many readers, the readers use CAS operations to safely claim tasks without locks.

The lastScheduledTask slot

On top of the ring buffer sits one special slot called lastScheduledTask. Let’s trace what happens when a coroutine dispatches new work:

You call launch(Dispatchers.Default) { doSomething() } from inside a worker thread.
The new task goes into the worker’s lastScheduledTask slot, the fast lane.
If the slot was already holding a previous task, that previous task gets pushed into the ring buffer (at the write end).
When the worker finishes its current work and looks for the next task, it checks lastScheduledTask first. The newest task runs next.

Why check lastScheduledTask before the ring buffer? The assumption baked into the scheduler is that the most recently dispatched coroutine is likely communicating with the currently running one. If coroutine A sends data to coroutine B through a channel, B’s task lands in lastScheduledTask. By checking this slot first, the worker picks up B immediately: same thread, same CPU cache, no scheduling delay. The ring buffer holds older tasks that are less likely to be part of an active back-and-forth. This is the semi-LIFO policy: newest first for latency, oldest preserved in FIFO order so nothing starves.

The actual poll() implementation is one line:

fun poll(): Task? = lastScheduledTask.getAndSet(null) ?: pollBuffer()

Grab lastScheduledTask atomically (clearing it in the process). If it was empty, fall back to the ring buffer.

The lastScheduledTask slot is also the one protected by the 100μs time gate during work stealing. Other workers can’t steal a task from this slot until it’s been sitting there for at least 100 microseconds. Tasks that have already been pushed into the ring buffer can be stolen immediately.

Global queues

There are two global queues shared across all workers: one for CPU tasks and one for blocking tasks. When you dispatch a coroutine from outside the scheduler (a callback, the main thread, or any non-worker thread), the task lands in the appropriate global queue. When a worker dispatches a new coroutine from within its own execution, the task goes into that worker’s local queue instead, keeping it close for cache locality.

This setup, local queues per worker plus shared global queues, is what enables the work stealing algorithm.

The work stealing algorithm

When a worker finishes its tasks and its local queue is empty, it doesn’t just park immediately. It tries to steal work from other workers.

flowchart TD
    A[Worker's local queue empty] --> B{Try acquire CPU permit?}
    B -->|Yes| C[Search local queue → Global CPU queue → Global Blocking queue → Steal from others]
    B -->|No| D[Search local queue for blocking → Global Blocking queue → Steal blocking only]
    C --> E{Task found?}
    D --> E
    E -->|Yes| F[Execute task]
    E -->|No| G{Stealable task exists but too fresh?}
    G -->|Yes| H[Park for delay then retry]
    G -->|No| I[Add self to parked workers stack]

The stealing loop is random. Each worker picks a random starting index and scans through all other workers’ queues:

var currentIndex = nextInt(created)
repeat(created) {
    ++currentIndex
    if (currentIndex > created) currentIndex = 1
    val worker = workers[currentIndex]
    if (worker !== null && worker !== this) {
        val stealResult = worker.localQueue.trySteal(stealingMode, stolenTask)
        // ...
    }
}

To pick which worker to steal from, the scheduler needs a random number. It uses a lightweight algorithm called xorshift: just a few bitwise operations, no heavy math. The standard ThreadLocalRandom wasn’t an option because it’s unavailable on Android, and wrapping Random in a ThreadLocal turned out to be up to 15% slower in Ktor benchmarks. So the coroutines team rolled their own.

Time-based affinity

We mentioned the 100μs time gate earlier; the lastScheduledTask slot is protected from stealers. Here’s the actual constant:

internal val WORK_STEALING_TIME_RESOLUTION_NS = systemProp(
    "kotlinx.coroutines.scheduler.resolution.ns", 100000L
)

Every task gets a submissionTime timestamp when created. If a task was submitted less than 100μs ago, other workers won’t steal it from the head slot. Older tasks in the ring buffer are fair game immediately. This is a deliberate trade-off: it keeps communicating coroutines on the same thread. If coroutine A sends a message to coroutine B via a channel, you want B to run on the same worker as A. Stealing it to a different core would mean a cache miss and cross-core memory traffic.

The scheduler comment says this approach “shows outstanding results when coroutines are cooperative” but acknowledges a downside: “the scheduler now depends on a high-resolution global clock, which may limit scalability on NUMA machines.”

Scheduling policy: not FIFO, not LIFO, but both

We covered above how the lastScheduledTask slot creates semi-LIFO ordering: newest task runs next, previous task gets pushed to the ring buffer. This policy is inspired by the Go runtime scheduler (credited to Dmitry Vyukov in the source comments). It couples communicating coroutines together, reducing scheduling latency. But bumping the old task into the ring buffer preserves some FIFO ordering, preventing the queue from degenerating into a pure stack.

Tasks dispatched from external threads (outside the scheduler) go into the global queue, which is strictly FIFO.

Anti-starvation: balancing local and global queues

Every worker has a local queue, and the scheduler has two global queues (CPU and blocking). If workers only checked their local queues, externally submitted tasks could starve.

The scheduler solves this probabilistically:

val globalFirst = nextInt(2 * corePoolSize) == 0

On an 8-core machine, there’s a 1-in-16 chance a worker checks global queues before its own local queue. Statistically, external tasks always make progress while the common case (local queue first) stays fast.

The parked workers stack

When a worker has nothing to do, it doesn’t just sleep. It pushes itself onto a lock-free Treiber stack of idle workers. The stack is intrusive: each Worker object acts as a stack node via its nextParkedWorker field.

Why a stack instead of a queue? The source code explains:

“The stack is better than a queue (even with the contention on top) because it unparks threads in most-recently used order, improving both performance and locality. Moreover, it decreases threads thrashing, if the pool has n threads when only n / 2 is required, the latter half will never be unparked and will terminate itself after [IDLE_WORKER_KEEP_ALIVE_NS].”

Thread thrashing here means threads being repeatedly woken up and put back to sleep with no useful work done. The OS burns CPU cycles on scheduling overhead instead of actual computation. A stack avoids this because only the most recently parked worker gets woken first. If demand is low, the older workers at the bottom of the stack stay asleep indefinitely.

Workers above corePoolSize that haven’t been needed for 60 seconds (configurable via kotlinx.coroutines.scheduler.keep.alive.sec) terminate themselves. Core workers stick around. This keeps the pool lean during quiet periods without losing the baseline capacity.

Main versus Main.immediate

This is an Android-centric distinction, but it matters anywhere Dispatchers.Main is available (JavaFX, Swing). That was surprising to me. I always assumed Main was an Android-only concept, but any platform with a UI event loop can provide one.

It’s easy to think that Dispatchers.Main and Dispatchers.Default function the same way, since both are non-blocking in nature and you should never do blocking work on either. But they serve very different purposes. Default is a pool of threads designed to parallelize CPU-bound computation across cores. Main is a single thread, the one thread that owns the UI. Android’s View hierarchy is not thread-safe, so only the main thread is allowed to touch it. Default exists to crunch numbers in parallel; Main exists because someone has to own the screen.

Dispatchers.Main always dispatches through the platform’s event loop. On Android, that’s Handler.post(). Even if you’re already on the main thread, launch(Dispatchers.Main) posts to the message queue and your code runs on the next loop iteration.

Dispatchers.Main.immediate skips the dispatch if you’re already on the correct thread. Your code runs synchronously, right now, without going through the event queue.

Dispatchers.Main: always posts to the message queue

flowchart LR
    L["launch\n(Dispatchers.Main)"] -->|"post"| Q

    subgraph Q["Main Thread Message Queue"]
        direction LR
        A["Touch\nEvent"] --> B["Animation\nCallback"] --> C["Other\nRunnable"] --> D["✦ Your\nCoroutine"] --> E["Draw\nFrame"]
    end

    style D fill:#7C3AED,color:#fff,stroke:#7C3AED

Your coroutine waits behind everything already in the queue. Expect at least some delay before execution.

Dispatchers.Main.immediate: skips the queue when already on main thread

flowchart LR
    L["launch\n(Main.immediate)"] -->|"already on\nmain thread?"| CHECK{" "}
    CHECK -->|"Yes"| NOW["✦ Runs inline\nright now"]
    CHECK -->|"No"| QUEUE["Posts to queue\n(same as Main)"]

    style NOW fill:#7C3AED,color:#fff,stroke:#7C3AED
    style CHECK fill:#7C3AED,color:#fff,stroke:#7C3AED

Here’s where it gets subtle. Consider this code:

// Running on Main thread
println("1")
launch(Dispatchers.Main) {
    println("2")
}
println("3")

Output: 1, 3, 2. The launch posts to the queue, so “2” runs after “3”.

// Running on Main thread
println("1")
launch(Dispatchers.Main.immediate) {
    println("2")
}
println("3")

Output: 1, 2, 3. Since we’re already on the main thread, immediate runs the coroutine body inline before continuing.

When does this matter in practice?

Say you’re on the main thread and a network response comes back. With Dispatchers.Main, the UI update gets queued behind whatever else is pending: touch events, animations, other posted runnables. That delay varies depending on what’s in the queue. With Main.immediate, the update happens right now, in the current call.

ViewModels in Android commonly use viewModelScope (from AndroidX Lifecycle), which defaults to Dispatchers.Main.immediate for exactly this reason. When a StateFlow emits a new value and a collector running on Main.immediate picks it up, the UI update happens in the same frame rather than being deferred to the next one.

The immediate dispatcher also forms an event loop when you have nested withContext(Dispatchers.Main.immediate) calls, preventing stack overflows. It shares this event loop with Dispatchers.Unconfined.

Small things worth knowing

The core pool minimum is 2. Even on a single-core machine, Dispatchers.Default allocates at least two threads. The comment in the source says this exists “to give us chances that multi-threading problems get reproduced even on a single-core machine.” Debugging concurrency bugs on a single-threaded pool would be near impossible. (The system property can force it as low as 1, but the default floor is 2.)

The maximum pool size is roughly 2 million threads. The scheduler uses 21 bits for thread count in its control state, giving a theoretical max of (1 << 21) - 2 = 2,097,150. You’ll run out of memory long before hitting that.

Dispatchers.IO shares threads with Dispatchers.Default. The documentation says: “withContext(Dispatchers.IO) { ... } when already running on the Default dispatcher typically does not lead to an actual switching to another thread.” The scheduler tries to keep execution on the same thread on a best-effort basis.

Workers use daemon threads. All scheduler worker threads have isDaemon = true. They won’t prevent JVM shutdown.

The Treiber stack uses versioned CAS. The parked workers stack encodes both a version counter and the top-of-stack index in a single Long. Version bits protect against ABA problems during concurrent push/pop operations.

yield() uses fair dispatching. When you call yield(), the coroutine goes to the tail of the queue (FIFO) instead of the head (LIFO). This prevents while (true) { yield() } from starving other coroutines. The dispatchYield method exists specifically for this.

limitedParallelism() on IO creates independent views. Each view from Dispatchers.IO.limitedParallelism(n) bypasses the IO parallelism cap entirely. They’re views of an unlimited backing dispatcher, not of Dispatchers.IO itself. The IO cap of 64 is just the default view.

Closing note

The coroutines scheduler is a well-thought-out piece of engineering. A single thread pool handles both CPU and IO work through a permit system, and work stealing with time-based affinity keeps communicating coroutines on the same core. The whole thing runs on lock-free data structures and atomic operations in the hot path. Thread creation and termination do use synchronized, but that’s off the per-task execution path.

Knowing this changes how you write coroutine code. You stop treating dispatchers as magic labels and start thinking about what the scheduler actually does with your tasks. You understand why Thread.sleep() on Dispatchers.Default is destructive. You see why Main.immediate exists and when it matters.

Read the source. CoroutineScheduler.kt is around 1000 lines and well-commented. I’d argue it’s some of the most readable concurrent code in the Kotlin ecosystem. You don’t need to understand every CAS loop, but spending an hour with it will change how you think about thread pools.

Glossary

Coroutine Scheduler

The internal thread pool implementation in kotlinx.coroutines that backs both Dispatchers.Default and Dispatchers.IO. Manages worker threads, CPU permits, work stealing, and thread lifecycle. Located in kotlinx-coroutines-core/jvm/src/scheduling/CoroutineScheduler.kt.

Core Pool Size

The number of threads reserved for CPU-bound work. Defaults to the number of available processors with a minimum of 2. Configurable via the kotlinx.coroutines.scheduler.core.pool.size system property.

CPU Permit

A token managed by the scheduler that grants a worker thread the right to execute CPU-bound (non-blocking) tasks. The total number of permits equals corePoolSize. Workers must acquire a permit to run CPU tasks and release it when switching to blocking work.

Work Stealing

A load-balancing technique where idle worker threads take tasks from busy workers’ local queues. The coroutines scheduler adds time-based affinity: the most recently submitted task (head slot) must be at least 100μs old before it can be stolen, keeping communicating coroutines on the same core. Older tasks in the ring buffer can be stolen immediately.

Treiber Stack

A lock-free concurrent stack data structure used by the scheduler to manage parked (idle) worker threads. Uses compare-and-swap operations for thread-safe push and pop without locks.

Blocking Context

A boolean flag (true) attached to tasks dispatched via Dispatchers.IO. Signals the scheduler that this task will block its thread, triggering thread compensation so CPU work continues uninterrupted.

Limited Parallelism

A mechanism to create views of a dispatcher with a constrained concurrency level. On Dispatchers.IO, these views are elastic and not subject to the default 64-thread cap. On Dispatchers.Default, requesting parallelism equal to or greater than the core pool size returns the dispatcher itself.

Dispatchers.Main.immediate

A variant of Dispatchers.Main that skips re-dispatching when already executing on the main thread. Coroutine body runs synchronously instead of being posted to the platform event queue. Used by viewModelScope and recommended for UI updates triggered from the main thread.

Thread Thrashing

A situation where threads are repeatedly woken up and put back to sleep without doing useful work. The OS spends CPU time on scheduling overhead (context switches, cache invalidation, memory barriers) instead of running your code.

CAS (Compare-And-Swap)

An atomic CPU instruction that updates a memory location only if it currently holds an expected value. The foundation of lock-free programming. If another thread changed the value first, the CAS fails and the caller retries. Avoids the overhead and deadlock risks of traditional locks.

ABA Problem

A subtle bug in lock-free algorithms using CAS. Thread 1 reads value A, gets preempted. Thread 2 changes A → B → A. Thread 1 resumes, sees A, and the CAS succeeds, but the state has changed underneath. Solved by adding a version counter to detect that the value was modified even if it looks the same.

NUMA (Non-Uniform Memory Access)

A hardware architecture where memory access time depends on which CPU socket the memory is attached to. A core reading memory from its local bank is fast; reading from a remote socket’s bank is slower. Relevant because a global clock (used for work stealing timestamps) can become a bottleneck when many cores across sockets contend on it.

Cache Locality

The performance benefit of keeping related work on the same CPU core, so data stays in that core’s L1/L2 cache. When a task gets stolen to a different core, the new core must fetch the task’s data from main memory or a remote cache (a cache miss), which is significantly slower than a cache hit.