Async I/O, event loops, coroutines, schedulers—these terms haunted me for years. They appeared together so often that I assumed they were the same thing, or at least variations of the same idea. Every time I thought I understood one, another would appear in a context that made me doubt everything.
Then I finally saw them as layers, not synonyms. And everything clicked.
The Hierarchy I Wish Someone Had Shown Me Earlier
My confusion came from a simple mistake: I was treating these concepts as alternatives or variations when they're actually layers stacked on top of each other. Once I drew this picture, years of confusion evaporated:
┌─────────────────────────────────────────┐
│ Programming Model │ ← What your code looks like
│ async/await, callbacks, ... │
├─────────────────────────────────────────┤
│ Scheduler │ ← Who decides what runs next
│ coroutines, green threads, tasks │
├─────────────────────────────────────────┤
│ Event Loop │ ← The wait-and-dispatch cycle
├─────────────────────────────────────────┤
│ I/O Multiplexing │ ← OS-level capability
│ epoll, kqueue, io_uring │
├─────────────────────────────────────────┤
│ Async I/O │ ← The goal: non-blocking I/O
└─────────────────────────────────────────┘
From bottom to top:
Async I/O is the goal—initiate an I/O operation without blocking, get notified later. The opposite is synchronous I/O: call read(), thread freezes, get result, continue.
I/O Multiplexing is the OS tool that makes it possible—one thread watching many file descriptors, handling whichever becomes ready. On Linux it's epoll, on macOS it's kqueue.
Event Loop is the program structure that uses this tool—a while loop that waits for events and dispatches handlers.
Scheduler decides who runs when—when multiple coroutines or callbacks want to run, who goes first?
Programming Model is what your code looks like—callbacks, Promises, async/await. This is syntax and code organization.
The reason these get conflated is that frameworks bundle them together. When someone says "Node.js uses async I/O," they're actually describing five layers at once. No wonder I was confused.
The Misconception That Held Me Back
Here's what I got wrong for years: I thought epoll_wait was like polling—constantly checking if something happened. It's not. Understanding this distinction was my breakthrough moment.
There are two different kinds of "waiting" in async I/O, and I kept conflating them.
The first is waiting for a specific I/O operation. Traditional synchronous blocking I/O works this way: you call read(socket), and your entire thread freezes on that one line, waiting for that one socket's data. Until data arrives, you can't do anything else. Async I/O eliminates this: you can issue multiple I/O requests and go do other work.
The second is waiting in the event loop. After the event loop has processed all ready tasks, if there's nothing left to do, it calls epoll_wait. This isn't waiting for a specific I/O—it's waiting for "any event I care about to happen." And crucially, this wait is active: the scheduler deliberately calls epoll_wait, sets a timeout, and lets the thread sleep. If an event arrives during that time, the thread wakes up to handle it. If the timeout expires, it wakes up to check timers.
So epoll_wait is not "the OS will knock on your door when something happens." It's you actively going to check, but you can check for many things at once, set a maximum wait time, and your thread sleeps during the wait consuming zero CPU. If you never call epoll_wait, you never get notified. There's no mechanism where the OS interrupts your running code to say "hey, I/O is ready." All I/O notifications come through this "go and collect" action.
To summarize: if you ask whether the process needs to stop and wait for a specific I/O (like a blocking read call), the answer is absolutely not. But if you ask whether the process waits during epoll_wait to collect notifications, yes—that's the process's sleep time, the essence of the "wait for events" phase in an event loop.
This distinction sounds subtle, but it unlocked everything else for me.
I/O Multiplexing vs True Async I/O
While I was at it, I discovered another distinction I'd been missing.
"Multiplexing" means one wait can watch many things. Without it, to wait on 100 sockets, you'd need 100 threads, each watching one. With multiplexing (epoll/kqueue), one thread can tell the kernel: "These 100 sockets—wake me if any of them has activity." One thread, 100 concurrent waits.
But here's where it gets interesting. The difference between I/O multiplexing and "true" async I/O is who moves the data.
With I/O multiplexing (epoll model): when epoll_wait returns, it tells you "these file descriptors are ready," but the data is still in the kernel buffer. You have to call read() to copy data into user space. This read() is fast (data is already in kernel buffer), but it's still a synchronous call.
With true async I/O (io_uring / IOCP model): you tell the kernel upfront "when data arrives, put it directly into this buffer." The kernel not only receives data but also copies it to your specified user-space buffer. When you get the completion notification, data is already there—no read() call needed.
Who waits? How many? Who moves data?
───────── ───────── ───────────────
Sync blocking You block One You
Sync non-blocking You poll One You
I/O Multiplexing You actively Many You
wait
True Async I/O You actively Many OS does it
wait
Most event loop libraries (libuv, tokio, asyncio) use epoll on Linux—that's I/O multiplexing. Because io_uring is relatively new (2019), many production environments haven't adopted it yet. So a more accurate statement is: these event loops rely on I/O multiplexing to achieve the effect of "async I/O"—not blocking a thread per I/O—but it's not true async I/O in the narrow sense.
This article focuses on layers three and four: the event loop and the scheduler. Let me walk through how different runtimes implement them, starting from the simplest.
JavaScript: Where It All Becomes Visible
JavaScript was where I finally started to understand, precisely because it's so constrained. One thread. Everything visible. The scheduling logic completely transparent.
The browser's event loop is slightly more complex than Node's because it has to handle rendering. Here's the complete cycle:
┌─────────────────────────────────────────────────────────────────┐
│ Browser Event Loop (One Iteration) │
│ │
│ ┌──────────────────┐ │
│ │ 1. Macrotask (1) │ ← setTimeout, I/O, user events... │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ 2. Microtasks │ ← Promise.then, queueMicrotask │
│ │ (all of them) │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ 3. Render check │ ← Time to render? (usually 16.67ms) │
│ └────────┬─────────┘ │
│ │ │
│ ├── No render needed ──→ Back to step 1 │
│ │ │
│ ▼ Render needed │
│ ┌──────────────────┐ │
│ │ 4. requestAnimationFrame callbacks (all) │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ 5. Layout │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ 6. Paint │ │
│ └────────┬─────────┘ │
│ ▼ │
│ Back to step 1 │
└─────────────────────────────────────────────────────────────────┘
Here's the pseudocode that made it crystal clear for me:
// Browser event loop main cycle
while (true) {
// ========== 1. Execute one macrotask ==========
if (macrotaskQueue.isNotEmpty()) {
task = macrotaskQueue.dequeue()
execute(task)
}
// ========== 2. Drain the microtask queue ==========
while (microtaskQueue.isNotEmpty()) {
microtask = microtaskQueue.dequeue()
execute(microtask)
// Note: executing microtasks may produce new microtasks
// They get added to the queue, this while keeps consuming
}
// ========== 3. Check if rendering needed ==========
if (shouldRender()) {
// ========== 4. Execute all rAF callbacks ==========
callbacks = rafQueue.snapshot()
rafQueue.clear()
for (callback in callbacks) {
execute(callback)
}
// ========== 5-6. Layout and paint ==========
recalculateStyles()
layout()
paint()
composite()
}
}
// ========== How various APIs enqueue tasks ==========
function setTimeout(callback, delay) {
after(delay) {
macrotaskQueue.enqueue(callback)
}
}
function Promise.then(callback) {
microtaskQueue.enqueue(callback)
}
function requestAnimationFrame(callback) {
rafQueue.enqueue(callback)
}
The key insight that clicked for me: cooperative scheduling. In JavaScript, a task decides when to yield. The scheduler can only switch when a task voluntarily finishes or hits an await. If a task doesn't yield, the scheduler just waits. The entire system's responsiveness depends on every task being "polite" and not hogging the thread for too long. One uncooperative task, and the whole system freezes.
This is both JavaScript's strength and weakness. Strength: simple, no concurrency issues—only one piece of code runs at a time, no race conditions, no locks. Weakness: one blocking task freezes everything, can't use multiple cores.
Python asyncio: Making the Implicit Explicit
Python's asyncio model is essentially the same as JavaScript—single-threaded, cooperative, event-loop-based. The key difference is that Python makes the coroutine object visible and manipulable.
In JavaScript, calling an async function returns a Promise. The Promise wraps "the future result," but you can't see "where execution paused."
In Python, calling an async def function returns a coroutine object. This object represents "a paused execution"—it knows where it stopped and can be resumed.
async def my_task():
print("start")
await asyncio.sleep(1)
print("end")
# Calling it returns a coroutine object, nothing executes yet
coro = my_task()
print(type(coro)) # <class 'coroutine'>
# Must give it to the event loop to run
asyncio.run(coro)
The pseudocode for asyncio's event loop clarified the mechanics for me:
class EventLoop:
def __init__(self):
self.ready_queue = [] # Tasks ready to run immediately
self.sleeping = [] # Tasks waiting on timers
self.waiting_io = {} # Tasks waiting on I/O {fd: task}
def run(self, main_coro):
# Wrap entry coroutine as Task, add to ready queue
main_task = Task(main_coro)
self.ready_queue.append(main_task)
while self.has_work():
# ========== 1. Execute all ready tasks ==========
while self.ready_queue:
task = self.ready_queue.pop(0)
# Key: drive coroutine to next await
result = task.step()
if result.type == 'FINISHED':
pass # Task complete
elif result.type == 'SLEEP':
wake_time = now() + result.duration
self.sleeping.append((wake_time, task))
elif result.type == 'WAIT_IO':
self.waiting_io[result.fd] = task
# ========== 2. Wait for events ==========
timeout = self.nearest_wake_time()
events = epoll_wait(self.waiting_io.keys(), timeout)
# ========== 3. Move ready tasks back to queue ==========
for fd in events:
task = self.waiting_io.pop(fd)
self.ready_queue.append(task)
for wake_time, task in self.sleeping:
if wake_time <= now():
self.sleeping.remove((wake_time, task))
self.ready_queue.append(task)
class Task:
def __init__(self, coro):
self.coro = coro
def step(self):
# Drive coroutine to next await
try:
what_it_waits = self.coro.send(None)
return what_it_waits
except StopIteration as e:
return Result(type='FINISHED', value=e.value)
The crucial part is task.step(): each call runs the coroutine until the next await, then pauses. It returns what the coroutine is waiting for. The event loop uses this information to decide which queue to put the task in.
And here's where my earlier insight about epoll_wait becomes concrete:
timeout = self.nearest_wake_time()
events = epoll_wait(self.waiting_io.keys(), timeout)
This call blocks. The process genuinely stops here, doing nothing, until an I/O is ready or timeout expires. This isn't polling.
Think of two ways to wait for food delivery. Polling: open the door every 30 seconds to check. You're constantly expending effort. Blocking wait: tell the delivery person "ring the bell when you arrive," then go to sleep. The bell wakes you up.
epoll_wait is the latter. You tell the OS "I care about these file descriptors, wake me if anything happens," then the process is suspended, removed from the run queue, consuming zero CPU. When the network card receives data, it triggers a hardware interrupt, the kernel wakes your process, and epoll_wait returns.
Time Your Program OS / Hardware
──── ──────────── ─────────────
t=0 call epoll_wait()
process sleeps
t=1ms (sleeping) CPU runs other programs
...
t=50ms (sleeping) NIC receives packet
hardware interrupt
kernel wakes you
t=50.01ms epoll_wait() returns
execution continues
From t=0 to t=50ms, your process isn't executing any code at all. It's frozen on that epoll_wait() line. This is why single-threaded event loops can efficiently handle massive concurrency—during waiting, they consume almost no CPU.
How does timeout get decided? The event loop calculates: when does my nearest timer expire?
def nearest_wake_time(self):
if self.sleeping:
nearest = min(wake_time for wake_time, task in self.sleeping)
return nearest - now()
else:
return infinity # No timers, sleep until I/O arrives
This way, timers never get missed, and there's no unnecessary polling. If there are only I/O tasks, timeout is infinite—sleep until I/O arrives. If there's a timer in 100ms, sleep for 100ms or get woken early by I/O.
Tokio: Breaking the Single-Thread Barrier
Python asyncio and JavaScript share the same fundamental limitation: one thread, one CPU core. What if you need more?
Tokio's core innovation: multiple threads, each running an event loop, sharing a task pool.
┌─────────────────────────────────────────────────────────────────┐
│ Tokio Runtime │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Shared Task Queue │ │
│ │ [task_a] [task_b] [task_c] [task_d] [task_e] ... │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ▲ ▲ ▲ ▲ │
│ │ │ │ │ │
│ ┌─────┴───┐ ┌─────┴───┐ ┌─────┴───┐ ┌─────┴───┐ │
│ │ Worker │ │ Worker │ │ Worker │ │ Worker │ │
│ │ Thread 0│ │ Thread 1│ │ Thread 2│ │ Thread 3│ │
│ │ │ │ │ │ │ │ │ │
│ │ local Q │ │ local Q │ │ local Q │ │ local Q │ │
│ │ epoll │ │ epoll │ │ epoll │ │ epoll │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘
At any moment, multiple tasks genuinely execute in parallel on multiple CPU cores.
But this creates a new problem: how do multiple executors share tasks? With single-threaded models, it's simple: take one from the queue, run it, take another. With multiple threads: multiple threads grabbing from the same queue needs locking, lock contention becomes a bottleneck; task execution times vary, some threads are overwhelmed while others idle; tasks spawn subtasks—where do those go?
Tokio's answer: Work Stealing.
Every worker thread has its own local queue, plus there's a global queue:
class Worker:
def __init__(self, worker_id, runtime):
self.local_queue = deque() # Local queue
self.runtime = runtime
def run(self):
while self.runtime.is_running():
task = self.find_task()
if task:
task.step()
else:
self.park() # No tasks, sleep
def find_task(self):
# Priority 1: Own local queue
if self.local_queue:
return self.local_queue.pop()
# Priority 2: Global queue
task = self.runtime.global_queue.try_pop()
if task:
return task
# Priority 3: Steal from other workers
for other_worker in self.runtime.workers:
if other_worker is not self:
stolen = other_worker.local_queue.try_steal_half()
if stolen:
self.local_queue.extend(stolen[1:])
return stolen[0]
return None
Local queue first: subtasks spawned by a task go into its worker's local queue. Grabbing from your own queue needs no locks (or only lightweight atomic operations) since only you're adding to it.
Stealing is last resort: only when you have nothing to do do you look at the global queue or steal from others. Stealing requires locking, but since it only happens when idle, contention is low.
Steal half: not one task but half the queue—reduces stealing frequency.
Why is this efficient?
Locality: tasks tend to stay on the thread that spawned them. Parent and child tasks often access the same data, benefiting from CPU cache.
Load balancing: busy threads spawn tasks into their local queue; idle threads actively steal. Balance emerges naturally.
Low contention: most of the time you're only operating on your own queue, no locks needed.
But here's the key: still cooperative. Within each task, scheduling is still cooperative. Await is the only switch point. The difference: in single-threaded models, one blocking task freezes the whole system. In multi-threaded models, it only freezes one worker—others keep running.
Go GMP: True Preemption
Go takes things further. Go's design goal is to make concurrent programming as simple as sequential code while ensuring system robustness. To achieve this, Go needs preemptive scheduling—if scheduling were cooperative, one poorly-written goroutine (like an infinite loop) would freeze the entire system. Go wants programmers not to worry about "should I yield here?"—the runtime handles it.
And to preempt, you must be able to pause and resume execution at arbitrary points, which requires saving complete execution context (local variables, call stack, registers). Therefore, Go must use stackful coroutines. This is a clear causal chain: design goal (robust concurrency) → needs preemption → needs arbitrary-point pausing → must have stacks.
Go has three core concepts:
G = Goroutine (task)
M = Machine (OS thread)
P = Processor (logical processor, execution permit)
P's count usually equals CPU cores, representing "how many goroutines can run simultaneously." M must bind a P to execute a G. Go also uses work stealing, but the key difference: Go has preemption.
What does preemption mean? In JS, Python, Tokio, if a task doesn't yield voluntarily, it never switches. Go is different:
func badGoroutine() {
for {
x := 1 + 1 // Infinite loop, no yield point
}
}
In cooperative models, this freezes one executor. In Go, the runtime forcibly interrupts it.
Go 1.14+ uses signal-based asynchronous preemption: the runtime sends a signal (SIGURG) to the thread, forcibly interrupting execution, saving state, switching to another goroutine. Even if you write an infinite loop, Go can pause it to let others run.
The Spectrum of Scheduling
Looking at all four models together:
| Model | Threads | Preemption | Multi-core | Scheduling Complexity |
|---|---|---|---|---|
| JavaScript | 1 | No | No | Low |
| Python asyncio | 1 | No | No | Low |
| Tokio | N | No | Yes | Medium |
| Go GMP | M:N | Yes | Yes | High |
Go's stackful coroutines mean each goroutine has its own call stack, can be preempted anywhere. Rust/Tokio's stackless coroutines are state machines, can only switch at await points, but switching overhead is smaller.
Complexity increasing →
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ JavaScript │ Python │ Tokio │ Go │
│ │ asyncio │ (Rust) │ GMP │
├──────────────┼──────────────┼──────────────┼──────────────┤
│ Single thread│ Single thread│ Multi-thread │ M:N │
│ Cooperative │ Cooperative │ Cooperative │ Preemptive │
│ Task queues │ Coroutine obj│ Work stealing│ Work stealing│
│ │ │ │ + preemption│
├──────────────┼──────────────┼──────────────┼──────────────┤
│ Simple, │ Simple, │ Uses all │ Uses all │
│ no locks │ no locks │ cores │ cores │
│ Can't use │ Can't use │ Low lock │ Truly robust │
│ multi-core │ multi-core │ contention │ Handles │
│ One blocks │ One blocks │ One blocks │ infinite │
│ all │ all │ one worker │ loops │
└──────────────┴──────────────┴──────────────┴──────────────┘
From left to right, complexity increases, but so does capability. The choice depends on your scenario:
If it's I/O-intensive with simple logic, JavaScript or Python asyncio is enough—simplicity is the advantage.
If you need multi-core but it's still mostly I/O-intensive, Tokio is a great choice.
If you have CPU-intensive tasks mixed in or need ultimate robustness, Go's GMP model is most powerful.
What This Journey Taught Me
What started as confusion about terminology became an exploration of fundamental trade-offs in systems design. The reason these concepts get bundled together isn't sloppiness—it's that they genuinely build on each other. You can't have a scheduler without an event loop, can't have an event loop without I/O multiplexing, can't have I/O multiplexing without async I/O as the goal.
The hierarchy I drew at the beginning isn't just a taxonomy. It's a map of dependencies, each layer enabling the one above it.
And understanding epoll_wait—really understanding that it's not polling, that the thread genuinely sleeps, that the OS genuinely wakes it through hardware interrupts—that was the key that unlocked everything else. Once I understood that one syscall, the rest of the architecture made sense.
If you're where I was—drowning in async terminology, unsure how coroutines relate to event loops relate to schedulers—I hope this helps. Start from the bottom. Understand what epoll_wait actually does. Then work your way up. The layers aren't arbitrary. They're the only way to build this kind of system, and seeing why is more valuable than memorizing the taxonomy.