Asynchronous I/O in Windows for Unix Programmers (2011)

kogir · on June 21, 2015

Crazy that IOCPs were in NT 3.5 in 1994[1], but it took FreeBSD until 2000 to add kqueue[2], and Linux until 2002 to add epoll[3][4].

I guess it just must not have mattered that much? Is it only recently machines became primarily IO bound?

  [1]https://en.wikipedia.org/wiki/Windows_NT_3.5
  [2]https://en.wikipedia.org/wiki/History_of_FreeBSD
  [3]https://en.wikipedia.org/wiki/Epoll
  [4]https://lwn.net/Articles/13197/

chubot · on June 21, 2015

I think being I/O bound is orthogonal to whether you need kqueue or epoll.

You could be I/O bound on a small number of simultaneous sockets, in which case select() would work fine. Or you can be I/O bound on a lot, in which case select() would take too much CPU by scanning the descriptor table repeatedly.

http://stackoverflow.com/questions/970979/what-are-the-diffe...

kogir · on June 21, 2015

Oh, that makes sense.

Older servers couldn't handle enough clients for connection related limits to matter.

trentnelson · on June 22, 2015

IOCP is just the user-facing artifact; the interesting piece is the I/O request packet, or Irp, at the kernel level, which dates back to VMS.

majke · on June 20, 2015

Actually, there is quite a context here. IOCP had been a headache for quite a while for Unix programmers. Two examples:

- libevent 2.0 was created exactly to support IOCP https://blog.torproject.org/blog/some-notes-progress-iocp-an...

- apache 2.0 again, was created to support windows threading model and IOCP

- this particular article was in the context of libev not supporting IOCP, which was needed for decent node.js performance on windows: https://nodejs.org/codeconf.pdf

(correct me if I'm wrong!)

nitwit005 · on June 21, 2015

Porting unix applications that used select/poll/epoll in a performant way was a common enough problem that they added WSAPoll to make things easier: http://blogs.msdn.com/b/wndp/archive/2006/10/26/wsapoll.aspx

The biggest abstraction headache is usually nonblocking file IO and DNS lookups, as many OSes don't have any support. You have to have a thread pool fallback.

rough-sea · on June 21, 2015

correct - this was the original design doc for what became https://github.com/libuv/libuv

csense · on June 21, 2015

> regular file system files do not support non-blocking operations. (Disturbingly no man page mentions this rather important fact.)

This is the most disturbing thing I've read about UNIX file I/O in a long time. Think about what the article is claiming:

- Non-blocking filesystem calls don't work on files!

- This fact isn't documented!

- The situation has persisted for a long time (I kind of assume it's been 10+ years since non-blocking I/O was introduced in popular UNIXes)

Can any gurus confirm these allegations the article makes?

phs2501 · on June 21, 2015

* Non-blocking filesystem calls do not in fact work on files. If you think about it for a minute this makes sense. Network data is flowing in to you asyncronously, it will show up if you don't do anything (and trigger poll). But you have to request to read a file to make the OS make the disk go get it. This is why the async read model works for disk files where the poll/read model doesn't; a disk file is always readable as the data is already there. In Linux in particular the kernel does not (I think, I may be getting the explanation wrong) have a wait queue for page faults, it can only block the faulting process, so you can't do async (non-blocking) cached file I/O at all.

* It is not well documented in man pages. It's a relatively well-known problem if you google for it.

* The situation for traditional non-blocking reads of disk files has persisted because it is unsolvable in general, poll/read just doesn't work for for reasons described above. You need async read/write, which Linux in particular does not have good support for (see above regarding cached reads). I think the situation on the BSDs may be better, but I'm not sure. There's a number of articles in the kernel section on LWN discussing async I/O that may be enlightening if you want to know more about what's been done here. (http://lwn.net/Kernel/Index/#Asynchronous_IO)

userbinator · on June 21, 2015

unlike the modern constant-time multiplexers like epoll

I'm curious as to how this exactly works - I mean, if you have 10000 file descriptors, and you need to process all of them, there's no way around doing that processing 10000 times. To me it appears O(n) is a lower bound and what epoll does is just move the selection loop from the application into the kernel.

fractaled · on June 21, 2015

Setting up the wait for 10k file descriptors will still be O(n), the actual wait is O(1). And presumably you're waiting many times, and not changing which 10k files you're polling for each wait. Each wait is O(1) because the kernel just calls you back when your wait is satisfied -- it associates a list of waiters for each file descriptor.

asveikau · on June 21, 2015

The O(1) part is:

* It's O(1) for user mode to know which fd in the set is signaled after returning from the wait syscall.

With select() or poll() the caller needs to traverse the list to know which fd is ready. With epoll(), kqueue(), or IOCP, you are draining a queue consuming events, each event removed from the queue in O(1) without re-scanning the entire table.

* It's O(1) to add an fd to the set of watched fds, or to re-enter the "blocked" state.

Because select() and poll() specify the entire list of watched fds all the time, the kernel side ends up copying and scanning the whole list at every wait step [O(n)], whereas epoll/kqueue/IOCP just leave the list of FDs in kernel memory across calls.

userbinator · on June 21, 2015

I was curious enough about the O(1) claims to look at the source code[1], and I don't think this is true:

It's O(1) to add an fd to the set of watched fds

It uses a red-black tree to keep track of the set of watched fds associated with each epoll fd, which makes insertion O(log n). However, waiting and notification can be done in O(1).

[1] http://lxr.free-electrons.com/source/fs/eventpoll.c

asveikau · on June 21, 2015

OK, so I was being a bit fast and loose with my description of insert. What I was going for is it doesn't copy and scan the entire list every time.

It's also not too hard to imagine a constant time implementation. The api would allow for it. Whereas poll is broken at the api level. It assumes O(n) at every call and there's no good way to get around it.

nitwit005 · on June 21, 2015

Imagine you're the kernel, and you get notified that some new data is available on a connection. The basic operations you have to do are:

1) Figure out the associated file descriptor (hashmap lookup perhaps) 2) Copy the new data to the socket's buffer 3) Wake up any threads waiting to read from the socket, or polling with a read interest

Obviously, that ignores a lot of protocol work, but that's the general idea. There isn't any need to loop over all of the open sockets.