This article assumes you already understand three things: how the kernel schedules threads (context switches, run queues, the cost of saving and restoring registers); why blocking I/O is a problem (a read() that blocks puts the entire thread to sleep, wasting a kernel-scheduled resource while it waits for data); and how I/O multiplexing solves it (epoll lets one thread monitor thousands of file descriptors without blocking on any of them). If any of these are unfamiliar, stop here and go learn them first. Not because you can't learn the mechanics of async Rust without them, but because you will never understand why any of this exists.
If you understand those three things, then a coroutine runtime is just all of them moved to userspace, but cheaper. How much cheaper, and why — that's what this article is about.
The M:N threading model makes this gap existential. M coroutines share N OS threads, where M is potentially hundreds of thousands and N is the number of CPU cores. A single worker thread carries thousands of tasks. If that thread blocks on a syscall, every task on it stalls — not one sleeps, thousands do, and the throughput of an entire core drops to zero. This is the central constraint of every coroutine runtime: the worker thread must never block. Every design decision in Tokio follows from this single rule.
With that framing, here's the program we'll trace through the entire article:
use tokio::time::{sleep, Duration};
async fn delayed_sum(a: u64, b: u64) -> u64 {
sleep(Duration::from_secs(1)).await;
a + b
}
#[tokio::main]
async fn main() {
let handle = tokio::spawn(delayed_sum(3, 4));
let result = handle.await.unwrap();
println!("{result}");
}
One async function, one .await point, one spawned task. delayed_sum waits one second, then returns 3 + 4. Simple enough that you can hold the entire program in your head — but every mechanism in Tokio is exercised: the compiler transforms delayed_sum into a state machine, the scheduler polls it, the timer registers a waker, the task suspends, the timer fires, the waker pushes the task back, the scheduler polls again, and the result flows through the JoinHandle.
The round trip
Step 1: Compile. The compiler sees async fn delayed_sum and transforms it into a state machine struct that implements the Future trait. Each .await point becomes a state. Local variables that need to survive across the await are moved from the stack into the struct. The function body becomes the trait's poll() method — a single function that can be called repeatedly, resuming where it left off each time. This is the contract between your code and the runtime: Tokio doesn't know what your future does internally, it only knows how to call poll().
Step 2: Spawn. tokio::spawn(delayed_sum(3, 4)) does three things: calls delayed_sum(3, 4) which creates the state machine struct (state set to Start, nothing executes yet); wraps it into a task object; puts the task into the scheduler's queue.
Step 3: Pick and poll. A worker thread takes the task from its queue and calls poll() on it. The state machine starts running: it creates the Sleep future, advances to WaitingSleep, and immediately asks the sleep future "are you done?"
Step 4: Not yet. The sleep future checks the deadline — one second hasn't passed. It saves a callback ("when the time comes, wake me up"), and returns Pending. The delayed_sum state machine sees Pending from the inner future, so it also returns Pending. The worker moves on to the next task.
Step 5: Disappear. The task is now in no queue — not a wait queue, not a ready queue. It simply sits on the heap. The only thing keeping it alive is the callback it left behind in step 4, which holds a reference to it.
Step 6: Wake. One second passes. Tokio's timer fires, finds the callback from step 4, and calls it. The callback does one thing: push the task back into a worker's queue.
Step 7: Poll again. A worker picks up the task and calls poll() again. The state machine resumes at WaitingSleep, asks the sleep future again — this time the deadline has passed, it returns Ready. The state machine continues: computes 3 + 4, returns Ready(7). The result flows through the JoinHandle back to main, which prints 7.
That's the complete lifecycle. Every piece I'll unpack in the rest of this article — the state machine, the callback mechanism, the timer, the scheduler — serves exactly one of these seven steps.
What the compiler actually produces
Calling delayed_sum(3, 4) executes nothing. It returns a struct. The compiler found one .await point, so it creates two states (plus Done), and promotes a and b into struct fields because they're used after the await.
enum DelayedSumState {
Start,
WaitingSleep,
Done,
}
struct DelayedSumFuture {
state: DelayedSumState,
a: u64,
b: u64,
sleep_future: Option<Sleep>,
}
The poll() method is a loop wrapping a match:
impl Future for DelayedSumFuture {
type Output = u64;
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<u64> {
loop {
match self.state {
Start => {
self.sleep_future = Some(sleep(Duration::from_secs(1)));
self.state = WaitingSleep;
// no return — loop continues, immediately polls sleep
}
WaitingSleep => {
match self.sleep_future.as_mut().unwrap().poll(cx) {
Poll::Ready(()) => {
self.sleep_future = None;
self.state = Done;
return Poll::Ready(self.a + self.b);
}
Poll::Pending => return Poll::Pending, // this is the yield
}
}
Done => panic!("polled after completion"),
}
}
}
}
The loop matters. It's not the scheduler's loop — it's this future's internal state advancement. When Start executes, it doesn't return — the loop immediately enters WaitingSleep and polls the sleep future. Only when the inner future returns Pending does execution leave this function. If the sleep had already expired (say, a zero-duration sleep), the entire delayed_sum would complete in a single poll() call with no round-trip to the scheduler. That's what makes .await cheap when things are already ready.
Why a struct and not a stack
This is the fundamental design decision that separates Rust's async model from Go's.
The problem: a coroutine suspends, but its local variables need to survive. A normal function return destroys the stack frame, destroying every local variable with it. But poll() has to return Pending (to yield) while keeping the locals alive for next time.
Go gives every goroutine its own stack — initially 2KB, dynamically growable. When a goroutine suspends, the stack stays in memory, locals undisturbed. The scheduler swaps the stack pointer register to another goroutine's stack. The cost: a million goroutines means gigabytes of pre-allocated stack memory, plus the runtime complexity of stack growth (copying stacks, updating internal pointers).
Rust takes the other path. The compiler analyzes which variables cross await points, promotes exactly those into a heap-allocated struct, and leaves everything else on the regular thread stack. When poll() returns, the stack frame disappears, but the struct survives on the heap. Next poll() call reads from the struct and continues. The struct might be a few dozen bytes. Every future on the same worker thread shares the same thread stack, taking turns. There's no stack management at runtime — no growth, no copies, no register save/restore. poll() is a function call, return is a function return.
What Tokio wraps around the kernel
Standard library std::net::TcpStream is blocking — its read() holds the entire thread hostage until data arrives. You can't use it in an async runtime. Tokio provides tokio::net::TcpStream, which wraps the same socket but sets it to O_NONBLOCK mode and registers it with Tokio's epoll instance. The difference: when there's no data, read() doesn't block — it immediately returns EWOULDBLOCK, and the thread is free to do other work.
When you call stream.read(&mut buf) on a Tokio TcpStream, the return type is Read<'_, TcpStream> — a struct defined in tokio::io::util::read that implements Future. It's a thin one-shot adapter: no internal state machine, it resolves in exactly one successful poll_read. Here's the actual implementation, slightly reformatted:
impl<R> Future for Read<'_, R>
where
R: AsyncRead + Unpin + ?Sized,
{
type Output = io::Result<usize>;
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<usize>> {
let me = self.project();
let mut buf = ReadBuf::new(me.buf);
ready!(Pin::new(me.reader).poll_read(cx, &mut buf))?;
Poll::Ready(Ok(buf.filled().len()))
}
}
poll_read is where the real work happens — for TcpStream, it calls the non-blocking read() syscall. Data there? The ready! macro lets execution continue, returning Ready(Ok(n)). Not there? The kernel returns WouldBlock, poll_read registers the waker with epoll and returns Pending, and ready! propagates that Pending outward. That's it — the difference between blocking a thread and yielding to the scheduler lives inside poll_read.
In our delayed_sum example, the inner future is Sleep, not a socket read. But the protocol is the same. Sleep.poll() checks the deadline against a timing wheel: passed? Ready(()). Not passed? Register waker with the timer driver, return Pending. The poll chain is always layered:
scheduler calls DelayedSumFuture.poll()
└── state is WaitingSleep → calls Sleep.poll()
├── deadline passed → Ready(()) → state machine computes a + b, returns Ready(7)
└── deadline not reached → register Waker with timer, return Pending → bubbles up
Every layer follows the same protocol: try to make progress, return Ready if done, register a Waker and return Pending if not. The contract is recursive — outer futures see Pending from inner futures and propagate it upward.
Tokio applies this pattern across every I/O type it supports. Network I/O (tokio::net) uses epoll to watch file descriptors — this is the true non-blocking path. Timers (tokio::time) use a timing wheel to track deadlines. File I/O (tokio::fs) is a special case: Linux doesn't actually support non-blocking reads on regular files (epoll reports them as always ready), so Tokio offloads file operations to a spawn_blocking thread pool, using real threads to simulate async. Signals, process management, stdin/stdout — all wrapped with the same Future interface, each using whatever underlying mechanism makes sense.
From file descriptor to Waker: the slab trick
This is the mechanism that connects "kernel says fd 42 is readable" to "wake up Task A."
When a tokio::net::TcpStream is created, Tokio does two things. First, it calls epoll_ctl(EPOLL_CTL_ADD, fd, event) to register the fd with epoll. The event struct has a data field — a user-defined value that epoll doesn't interpret, just returns as-is when the fd becomes ready. Tokio stores a token (an integer index) in this field.
Second, Tokio's I/O driver maintains a slab — essentially a Vec where elements are accessed by index. The token is the slab index. At that index sits a ScheduledIo struct with a slot for a Waker.
slab:
[0]: ScheduledIo { waker: None }
[1]: ScheduledIo { waker: Waker A } ← fd 42
[2]: ScheduledIo { waker: Waker C } ← fd 63
[3]: ScheduledIo { waker: Waker B } ← fd 57
epoll registration:
fd 42 → event.data = 1
fd 57 → event.data = 3
fd 63 → event.data = 2
When ReadFuture.poll() returns Pending, the register_waker(cx.waker()) call clones the Waker and stores it in the slab slot for this fd.
When epoll_wait returns "fd 42 is ready, event.data = 1," Tokio indexes into slab[1], pulls out Waker A, calls wake(). O(1) lookup, no scanning, no hash map.
Where a suspended task lives
After poll() returns Pending, the task is not in any queue. Not a wait queue, not a ready queue. It's gone from the scheduler's perspective.
But it's not freed, because the Waker stored in the slab slot holds an Arc<Task>. That reference count keeps the task alive on the heap.
For our delayed_sum example, the timer driver (not the slab — the slab is for epoll-backed I/O) holds the waker. But the principle is identical:
timer wheel entry (deadline: now + 1s) = {
waker: Waker → Arc<Task> → heap-allocated Task {
future: DelayedSumFuture {
state: WaitingSleep,
a: 3, b: 4,
sleep_future: Sleep { deadline: ... },
}
}
}
The run queue only contains ready tasks. Tasks enter the run queue in exactly two ways: freshly spawned, or woken by Waker.wake(). There is no separate "blocked queue." Suspended tasks simply exist on the heap, anchored by the Arc reference inside a Waker that's sitting in a slab slot somewhere.
If the I/O never completes and the JoinHandle gets dropped, the Waker eventually gets cleaned up, the Arc reference count hits zero, and the task is freed. That's cancellation.
The scheduler: work-stealing across cores
Tokio spawns one worker thread per CPU core by default. Each worker has a local run queue. There's also a shared inject queue.
tokio::spawn on a worker thread puts the new task in that worker's local queue. tokio::spawn from outside a worker puts it in the inject queue.
A single worker's loop, simplified:
fn worker_loop(
local_queue: &mut RunQueue,
inject_queue: &SharedQueue,
other_workers: &[RunQueue],
io_driver: &mut IoDriver,
) {
loop {
let task = local_queue.pop()
.or_else(|| inject_queue.pop())
.or_else(|| steal_from(other_workers))
.unwrap_or_else(|| {
io_driver.park(); // epoll_wait — block until I/O or timer event
local_queue.pop().expect("woke up with no task")
});
let waker = task.build_waker(); // holds Arc<Task>
let cx = &mut Context::from_waker(&waker);
match task.future.poll(cx) {
Poll::Ready(output) => task.complete(output),
Poll::Pending => {} // task is gone — waker keeps it alive
}
}
}
Work-stealing is the load balancing strategy: each worker drains its own queue first, then steals from the tail of another worker's queue. This avoids a single global queue with lock contention on every task.
The parallel with Linux kernel scheduling
I find this comparison clarifying, because Tokio's async scheduling and Linux's thread scheduling are structurally almost identical — the difference is in what gets saved and how the switch happens.
| Linux kernel | Tokio | |
|---|---|---|
| Scheduling unit | thread (task_struct) | task (Arc<Task>) |
| Saved context | kernel stack + all registers | Future struct on the heap |
| Ready queue | per-CPU runqueue | per-worker local run queue |
| Yielding | schedule() | return Poll::Pending |
| Wait mechanism | wait_queue_entry | Waker in slab ScheduledIo |
| Wake-up | wake_up() | waker.wake() pushes to run queue |
| I/O notification | interrupt → softirq → protocol stack | epoll_wait returns ready fds |
| Context switch cost | save/restore registers, switch page tables | function return + function call |
The last row is the punchline. A kernel context switch saves and restores every register, flushes TLB entries, switches page tables. Tokio's "switch" is poll() returning, then calling poll() on a different task. An ordinary function return followed by an ordinary function call, entirely in userspace. Under the hood, Tokio still relies on the kernel's epoll for I/O notifications — it's a userspace layer that translates "which fds are ready" into "which futures should be polled."
Why blocking I/O inside async is poison
Remember the constraint from the opening: in the M:N model, a single worker thread carries thousands of tasks, so the worker thread must never block. This is where that rule becomes concrete.
If you call std::fs::read_to_string or std::net::TcpStream::read inside an async task, you block the worker thread. While that thread is stuck in a syscall, every other task in its local run queue sits idle. With Tokio defaulting to one worker per core, a few blocked workers can freeze the entire runtime.
// blocks a worker thread — every other task on this thread starves
let data = std::fs::read_to_string("big_file.txt").unwrap();
// correct: Tokio's async version (internally uses spawn_blocking for files)
let data = tokio::fs::read_to_string("big_file.txt").await.unwrap();
// correct: isolate blocking work in the blocking thread pool
let data = tokio::task::spawn_blocking(|| {
std::fs::read_to_string("big_file.txt").unwrap()
}).await.unwrap();
There's no magic
async fn compiles to a struct. .await compiles to a match arm. Waker is a callback. The slab is a Vec. epoll_wait is a syscall. wake() pushes an Arc into a queue.
A kernel thread yields by calling schedule(). The thread is put to sleep. Its time slice is forfeit. Its registers are saved to the kernel stack. Its cache lines go cold. It will not run again until the scheduler decides to restore all of that, at full cost.
A Tokio task yields by returning Poll::Pending. No thread sleeps. No registers are saved. No cache is flushed. The worker thread is still running — it calls poll() on the next task, in the same loop, on the same stack, in the same warm cache. All of this machinery, every piece in this article, exists to replace the first paragraph with the second.
No magic.