It took me a long time to truly understand epoll. Not because it's complicated, but because I had the wrong mental picture: I imagined epoll as some kind of advanced polling, with the kernel constantly checking all file descriptors in the background. That's not what happens at all. To explain what epoll actually does, I need to start with fd itself.
Everything Is an fd
Unix has to manage many kinds of I/O resources: files on disk, TCP sockets, pipes, terminal devices, block devices, even epoll instances themselves. These things have completely different underlying implementations, but the kernel doesn't want userspace programs to deal with those differences. So from the very beginning, Unix made a design decision: use a single integer handle to represent any resource you can do I/O on. That's the fd (file descriptor).
Each process has a file descriptor table in the kernel, where each entry points to a struct file object. This struct has a key field f_op — a function pointer table (struct file_operations) containing pointers to read, write, poll, release, and others. Different resource types fill in different implementations. When you call read(fd, buf, count), the kernel looks up the fd in the table, finds the struct file, and calls file->f_op->read(). You don't need to know whether it's a socket or a pipe behind the scenes — the kernel dispatches to the right implementation. It's essentially a layer of polymorphism.
This explains why every concept in this article revolves around fd: the first argument to read(fd, ...) is an fd because the kernel needs it to locate the kernel object and its function pointer table; fcntl(fd, F_SETFL, O_NONBLOCK) targets an fd because non-blocking is a flag on struct file; epoll_ctl() monitors an fd because epoll needs to hook callbacks onto the kernel object behind it. The fd is the sole entry point for userspace programs to interact with kernel I/O resources.
When You Block, the Kernel Sleeps for You
When you call read() on an fd, execution drops from userspace into kernel space. The kernel checks whether the fd's kernel buffer has any data. If it doesn't, the kernel changes your process's state from running to sleeping, removes it from the CPU's run queue, and places it on the fd's wait queue. The process completely yields the CPU and is no longer scheduled. Only when data arrives — the kernel copies it from the device to the kernel buffer, then from the kernel buffer to the userspace buffer, then moves the process back to the run queue — does read() return.
From your code's perspective, the read() call just hangs there. Nothing after it executes.
This model is simple, intuitive, and the most natural to write. It's still widely used today. The problem is: if you need to handle many connections simultaneously, you need many threads, each blocking on its own fd. Once the thread count grows large, context switching and memory overhead become unsustainable.
Non-blocking Just Hands the Waiting Back to You
After setting an fd to non-blocking mode with the O_NONBLOCK flag, calling read() when the kernel buffer is empty doesn't suspend your thread. Instead, the kernel immediately returns -1 and sets errno to EAGAIN. Your thread keeps its CPU time slice and continues executing the code that follows.
The difference between blocking and non-blocking is fundamentally a process scheduling perspective: when data isn't ready, does the kernel wait for you (suspend the thread), or does it immediately tell you "not ready" and let you decide what to do next? Blocking means the kernel does the waiting on your behalf. Non-blocking means the kernel hands control back — whether to wait, and how to wait, is now your decision.
But non-blocking I/O on its own has an obvious problem: if you want to know when data arrives, you have to call read() in a loop to check — that's busy polling, wasting CPU for nothing. So non-blocking I/O is almost never used alone. It's paired with a multiplexing mechanism.
One Thread, Ten Thousand fds
If blocking one thread per fd is wasteful, and busy-polling all fds is stupid, can one thread efficiently wait on ten thousand fds at once? That's the problem multiplexing solves.
Using epoll has two phases: register, then wait.
In the registration phase, you call epoll_create() to have the kernel create an epoll instance (which is itself an fd), then use epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event) to add the fds and event types you care about (e.g., EPOLLIN for readability). Behind the scenes, the kernel maintains two data structures: a red-black tree storing all monitored fds, and a ready list storing fds that already have events. Each time epoll_ctl registers an fd, the kernel inserts a node into the red-black tree and hooks a callback function onto that fd's kernel object — this callback will be the key to understanding epoll, but I'll come back to it shortly.
In the waiting phase, you call epoll_wait(). Here's a crucial insight: epoll_wait() itself is blocking. If the ready list is empty, the calling thread gets put to sleep by the kernel — moved from the run queue to a wait queue, exactly like a blocking read(). It is not busy-polling. When the thread wakes up, it gets back an array of ready fds and can call read() or write() on each one. Since the kernel has already confirmed these fds have data, read() typically returns immediately.
In real server programs, epoll_wait() sits inside a while(true) loop — web servers, proxies, and message queues that handle concurrent I/O long-term naturally need a persistent loop to receive and process events. But don't confuse this loop with busy polling: each epoll_wait() call puts the thread to sleep, yielding the CPU. That's fundamentally different from calling read() in a loop and checking for EAGAIN. Event loop frameworks (libuv, Tokio's reactor) typically compute epoll_wait's timeout parameter dynamically — for example, checking how long until the nearest timer fires and using that delta as the timeout, so timers fire on time without the loop spinning idle.
I Thought epoll Was Traversing — It Was Just Glancing
This is where I got stuck the longest when trying to understand epoll.
It's natural to assume that when epoll_wait() is called, the kernel traverses every fd in the red-black tree, checks each one's buffer for data, and picks out the ones that are ready. That's exactly what I thought at first. And the older select/poll mechanisms actually do work this way — every call checks every registered fd, so 10,000 connections means 10,000 checks.
But epoll doesn't do that. When epoll_wait() is called, the kernel just glances at the ready list. If it's not empty, it copies the fds to a userspace array and returns. If it's empty, it puts the thread to sleep. No traversal, no buffer checks. So who puts fds into the ready list? Not epoll_wait() — it's done outside of epoll_wait(), by callback functions that fire passively the moment data arrives.
The Hitchhiking Callback
To understand this callback mechanism, you need to see something more fundamental: the wait queue.
Every I/O resource in the kernel comes with its own wait queue — a TCP socket's is sock->sk_wq, a pipe's is pipe_inode_info->wait. It's just a linked list (wait_queue_head_t) where each node (wait_queue_entry) holds a function pointer and a next pointer. In a world without epoll, when you do a blocking read() on a socket, the kernel hangs your thread on this very wait queue. When data arrives, the socket calls wake_up() to walk the queue and wake everyone waiting. This is the socket's own built-in logic — it has nothing to do with epoll.
What epoll does is hook its own callback onto that same wait queue — hitchhiking.
Specifically, when epoll_ctl() registers an fd, the kernel calls that fd's file->f_op->poll() method. This method does two things: first, it checks the current state — for example, a TCP socket's implementation (tcp_poll()) checks whether the receive buffer is empty, and if data is already there, returns POLLIN so epoll can immediately add the fd to the ready list; second, it registers a callback — it creates a wait_queue_entry, sets its function pointer to ep_poll_callback, and inserts it at the tail of the resource's wait queue. Just a linked list insertion.
ep_poll_callback does one simple thing: add the fd to epoll's ready list, then wake up the thread blocking on epoll_wait(). Every type of fd registers this exact same callback.
After the NIC Receives Data
Now that the callbacks are hooked up, let's see what happens when data actually arrives.
Take a TCP socket. The NIC receives a packet and fires a hardware interrupt. The interrupt handler moves the data from the NIC's hardware buffer into an sk_buff struct in memory, then triggers a soft interrupt. When the soft interrupt is scheduled, the data enters the kernel's protocol stack for layer-by-layer processing — the link layer strips the frame header, the IP layer resolves routing, and the TCP layer finds the matching socket by its four-tuple and runs the state machine logic. Once the TCP layer confirms the data is valid, it hangs the sk_buff on the socket's receive queue, then calls sock_def_readable(), which internally executes wake_up(sk->sk_wq):
void wake_up(wait_queue_head_t *wq) {
wait_queue_entry *entry;
list_for_each(entry, &wq->head) {
entry->func(entry);
}
}
It walks the wait queue from the head, pulling out and calling each node's function pointer. If a blocking read() thread is on the queue, it gets woken up to read data. If epoll's ep_poll_callback is on the queue, the fd gets pushed into the ready list and the thread blocking on epoll_wait() gets woken up. The socket itself has no idea who it's waking — it just calls wake_up() as it always does. Epoll is merely hitchhiking on this ride.
This design works across all fd types because the interface the kernel exposes to epoll is uniform: every resource's poll method returns an event mask indicating its current state and accepts a poll_table for registering callbacks. Epoll doesn't need to know whether the underlying resource is a socket, a pipe, or a timerfd — it only talks to this unified poll interface.
One Sleep, All fds
In a world without epoll, waiting on 10,000 connections means 10,000 threads, each blocking on its own fd — 10,000 sleeps for 10,000 fds.
With epoll, each of those 10,000 fds has a callback on its wait queue, and all callbacks point to the same ready list. Only one thread sleeps, blocking on epoll_wait(). When data arrives on any fd, the resource's own wake_up() fires epoll's callback, the callback pushes the fd into the ready list and wakes that one thread. The thread wakes up and processes all ready fds in one pass.
One sleep, all fds. That's what epoll is.