The C10K problem

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 500MHz machine with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $1500 or so. Let's see - at 10000 clients, that's 50KHz, 100Kbytes, and 60Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of ten thousand clients. (That works out to $0.15 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

One of the busiest ftp sites, cdrom.com, actually can handle 10000 clients simultaneously through a Gigabit Ethernet pipe to its ISP. ( Here's a somewhat dated page about their configuration.) Pipes this fast aren't common yet, but technology is improving rapidly.

There's interest in benchmarking this kind of configuration; see the discussion on June 28th 1999 on linux-kernel, in which people propose setting up a testbed that can be used for ongoing benchmarks.

There appears to be interest in this direction from the NSF, too; see the Web100 project.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, for obvious reasons.

The C10K problem
Book to Read First
I/O Strategies
1. I/O frameworks [Updated 3 Oct 2000]
2. Serve many clients with each server thread, and use nonblocking I/O
  - The traditional select()
  - The traditional poll()
  - /dev/poll [Updated 11 July 2001]
  - FreeBSD Kernel Queues (kqueue)
  - Realtime Signals [Updated 19 May 2001]
3. Serve one client with each server thread
4. Serve many clients with each server thread, and use asynchronous I/O
5. Build the server code into the kernel
Comments
Limits on open filehandles
Limits on threads
Java issues [Updated 27 May 2001]
Other tips
Other limits
Kernel Issues
Measuring Server Performance
Examples
- Interesting select()-based servers
- Interesting /dev/poll-based servers
- Interesting kqueue()-based servers
- Interesting realtime signal-based servers Updated 29 May 2001
- Interesting thread-based servers
- Interesting in-kernel servers
Other interesting links

Book to Read First

If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem.

I/O Strategies

There seem to be four ways of writing a fast web server to handle many clients:

serve many clients with each server thread, and use nonblocking I/O
serve one client with each server thread
Serve many clients with each server thread, and use asynchronous I/O
Build the server code into the kernel

plus there exist I/O frameworks that provide a higher level of abstraction; potentially these can pick the appropriate I/O method for whatever platform you're compiling for, providing easy porting and high performance.

I/O frameworks

ACE, a heavyweight C++ I/O framework, contains object-oriented implementations of some of these I/O strategies and many other useful things. In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O.
Poller is a lightweight C++ I/O framework written by yours truly. It's a thin wrapper around the poll() and select() system calls as well as the /dev/poll and kqueue interfaces. Benchmark results using Poller to compare the performance of poll() to /dev/poll on Linux and Solaris are available.
Matt Welsh wrote a paper in April 2000 about how to balance the use of worker thread and event-drivent techniques when building scalable servers. The approach he describes counts as an I/O framework, I think.

1. Serve many clients with each server thread, and use nonblocking I/O

... set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. It's still being improved; see e.g. Niels Provos' benchmarks with hinting poll and thttpd.
An important bottleneck in this method is that read() or sendfile() from disk blocks if the page is not in core at the moment; setting nonblocking mode on a disk file handle has no effect. Same thing goes for memory-mapped disk files. The first time a server needs disk I/O, its process blocks, all clients must wait, and that raw nonthreaded performance goes to waste.
Worker threads or processes that do the disk I/O can get around this bottleneck. One approach is to use memory-mapped files, and if mincore() indicates I/O is needed, ask a worker to do the I/O, and continue handling network traffic. Jef Poskanzer mentions that Pai, Druschel, and Zwaenepoel's Flash web server uses this trick; they gave a talk at Usenix '99 on it. It looks like mincore() is available in BSD-derived Unixes like FreeBSD and Solaris, but is not part of the Single Unix Specification. It's available as part of Linux as of kernel 2.3.51, thanks to Chuck Lever.

There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:

The traditional select(). Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs.
The traditional poll(). There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.
/dev/poll
The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you're interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle. It looks like this is lower overhead than even the realtime signal approach described below.
It appeared quietly in Solaris 7 (see patchid 106541) but its first public appearance was in Solaris 8; according to Sun, at 750 clients, this has 10% of the overhead of poll().
Linux is also starting to support /dev/poll, although it's not in the main kernel as of 2.4.4. See N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] A patch implementing /dev/poll for the 2.2.14 Linux kernel is available from www.citi.umic.edu. Note: Niels reports that under heavy load it can drop some readiness events. Kevin Clark updated it for the 2.4.3 kernel (see www.cs.unh.edu/~kdc/dev-poll-patch-linux-2.4.3), and is looking at the load problem (as of 16 May 2001)...
On 11 July 2001, Davide Libenzi posted a reimplementation of /dev/poll for Linux; see www.xmailserver.org/linux-patches/nio-improve.html for his results. Unlike the previous implementation of /dev/poll for Linux, it is said to be completely free of linear scans, and scales better to thousands of connections.
Solaris's /dev/poll support may be the first practical example of the kind of API proposed in [Banga, Mogul, Drusha '99], although Winsock's WSAsyncSelect might qualify, too.
Example source and benchmarks comparing the performance of poll() to /dev/poll on Linux and Solaris are available.
FreeBSD Kernel Queues
FreeBSD 5.0 (and possibly 4.1) support a generalized alternative to poll() called kqueue()/kevent(). (See also Jonathan Lemon's page.) Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you call kqueue() to allocate one. To change the events you are listening for, or to get the list of current events, you call kevent() on the descriptor returned by kqueue(). It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion.
Note: as of October 2000, the threading library on FreeBSD does not interact well with kqueue(); evidently, when kqueue() blocks, the entire process blocks, not just the calling thread.
Examples and libraries using kqueue():
- PyKQueue -- a Python binding for kqueue()
- Poller_kqueue -- an abstract base class in C++ plus implementations for kqueue(), poll(), select(), and /dev/poll
- Ronald F. Guilmette's example echo server; see also his 28 Sept 2000 post on freebsd.questions.
Realtime Signals
The 2.3.21 linux kernel introduced a way of using realtime signals to replace poll. fcntl(fd, F_SETSIG, signum) associates a signal with each file descriptor, and raises that signal when a normal I/O function like read() or write() completes.
To use this, you choose a realtime signal number, mask that signal and SIGIO with sigprocmask(), use fcntl(fd, F_SETSIG, signum) on each fd, write a normal poll() outer loop, and inside it, after you've handled all the fd's noticed by poll(), you loop calling sigwaitinfo().
If sigwaitinfo or sigtimedwait returns your realtime signal, siginfo.si_fd and siginfo.si_band give the same information as pollfd.fd and pollfd.revents would after a call to poll(), so you handle the i/o, and continue calling sigwaitinfo().
If sigwaitinfo returns a traditional SIGIO, the signal queue overflowed, so you flush the signal queue by temporarily changing the signal handler to SIG_DFL, and break back to the outer poll() loop.
You can support older kernels by surrounding the sigwaitinfo code with #if defined(LINUX_VERSION_CODE) && (LINUX_VERSION_CODE >= KERNEL_VERSION(2.3.21))
See Zach Brown's phhttpd for example code that uses this feature, and deals with some of the gotchas, e.g. events queued before an fd is dealt with or closed, but arriving afterwards.
[Provos, Lever, and Tweedie 2000] describes a recent benchmark of phhttpd using a variant of sigtimedwait(), sigtimedwait4(), that lets you retrieve multiple signals with one call. Interestingly, the chief benefit of sigtimedwait4() for them seemed to be it allowed the app to gauge system overload (so it could behave appropriately). (Note that poll() provides the same measure of system overload.)
Signal-per-fd
www.hpl.hp.com/techreports/2000/HPL-2000-174.html by Chandra and Mosberger proposes a modification called "signal-per-fd" which gets rid of the possibility of realtime signal queue overflow, and compares performance of this scheme with select() and /dev/poll. Read it!
A patch implementing the scheme proposed there was posted to linux-kernel on 18 May 2001 by Vitaly Luban, and lives at www.luban.org/GPL/gpl.html.

2. Serve one client with each server thread

... and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS's also have trouble handling more than a few hundred threads.

3. Serve many clients with each server thread, and use asynchronous I/O

This has not yet become popular, probably because few operating systems support asynchronous I/O, also possibly because it's hard to use. Under Unix, this is done with the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2, and in glibc 2.1. The generic glibc 2.1 implementation may have been written for standards compliance rather than performance.

SGI has implemented high-speed AIO with kernel support. As of version 1.1, it's said to work well with both disk I/O and sockets. It seems to use kernel threads.

Various people appear to be working on a different implementation that does not use kernel threads, and should scale better. It won't be available until kernel 2.5, though, probably.

The O'Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.

4. Build the server code into the kernel

Novell and Microsoft are both said to have done this at various times, at least one NFS implementation does this, khttpd does this for Linux and static web pages, and "TUX" (Threaded linUX webserver) is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux. Ingo's September 1, 2000 announcement says an alpha version of TUX can be downloaded from ftp://ftp.redhat.com/pub/redhat/tux, and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this approach, and the consensus seems to be instead of moving web servers into the kernel, the kernel should have the smallest possible hooks added to improve web server performance. That way, other kinds of servers can benefit. See e.g. Zach Brown's remarks about userland vs. kernel http servers.

Comments

Richard Gooch has written a paper discussing I/O options.

The Apache mailing lists have some interesting posts about why they prefer not to use select() (basically, they think that makes plugins harder). Still, they're planning to use select()/poll()/sendfile() for static requests in Apache 2.0.

Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asynchronous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, a rebuttal from Mingo, Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel. I suspect he was trying to say that Linux doesn't support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it's not so true anymore.

See these pages at sysinternals.com and MSDN for information on "completion ports", which he said were unique to NT; in a nutshell, win32's "overlapped I/O" turned out to be too low level to be convenient, and a "completion port" is a wrapper that provides a queue of completion events. Compare this to Linux's F_SETSIG and queued signals feature.

There was an interesting discussion on linux-kernel in September 1999 titled "> 15,000 Simultaneous Connections". (second week, third week) Highlights:

Ed Hall posted a few notes on his experiences; he's achieved >1000 connects/second on a UP P2/333 running Solaris. His code used a small pool of threads (1 or 2 per CPU) each managing a large number of clients using "an event-based model".
Mike Jagdis posted an analysis of poll/select overhead, and said "The current select/poll implementation can be improved significantly, especially in the blocking case, but the overhead will still increase with the number of descriptors because select/poll does not, and cannot, remember what descriptors are interesting. This would be easy to fix with a new API. Suggestions are welcome..."
Mike posted about his work on improving select() and poll().
Mike posted a bit about a possible API to replace poll()/select(): "How about a 'device like' API where you write 'pollfd like' structs, the 'device' listens for events and delivers 'pollfd like' structs representing them when you read it? ... "
Rogier Wolff suggested using "the API that the digital guys suggested", http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
Joerg Pommnitz pointed out that any new API along these lines should be able to wait for not just file descriptor events, but also signals and maybe SYSV-IPC. Our synchronization primitives should certainly be able to do what Win32's WaitForMultipleObjects can, at least.
Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime signals, and sigwaitinfo() was a superset of the API proposed in http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the signal blocked at all times if you're interested in performance; instead of the signal being delivered asynchronously, the process grabs the next one from the queue with sigwaitinfo().
Jayson Nordwick compared completion ports with the F_SETSIG synchronous event model, and concluded they're pretty similar.
Alan Cox noted that an older rev of SCT's SIGIO patch is included in 2.3.18ac.
Jordan Mendelson posted some example code showing how to use F_SETSIG.
Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG, and noted: "With a signal dequeuing mechanism, your application is going to get signals destined for various library components if libraries are using the same mechanism," but the library can set up its own signal handler, so this shouldn't affect the program (much).
Doug Royer noted that he'd gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server. Others chimed in with estimates of how much RAM that would require on Linux, and what bottlenecks would be hit.

Interesting reading!

Limits on open filehandles

Any Unix: the limits set by ulimit or setrlimit.
Solaris: see the Solaris FAQ, question 3.45.
FreeBSD: Edit /boot/loader.conf, add the line
```
set kern.maxfiles=XXXX
```
where XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he'd achieved far more than 10000 connections on FreeBSD 4.3, and says
"FWIW: You can't actually tune the maximum number of connections in FreeBSD trivially, via sysctl.... You have to do it in the /boot/loader.conf file.
The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable.
You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive."
Linux: See Bodo Bauer's /proc documentation. On current 2.2.x kernels,
```
echo 32768 > /proc/sys/fs/file-max
echo 65536 > /proc/sys/fs/inode-max
```
increases the system limit on open files, and
```
ulimit -n 32768
```
increases the current process' limit. I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
Stephen C. Tweedie posted about how to set ulimit limits globally or per-user at boot time using initscript and pam_limit.
In older 2.2 kernels, though, the number of open files per process is still limited to 1024, even with the above changes.
See also Oskar's 1998 post, which talks about the per-process and system-wide limits on file descriptors in the 2.0.36 kernel.

Limits on threads

On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.

Solaris: it supports as many threads as will fit in memory, I hear.
FreeBSD: ?
Linux 2.4: /proc/sys/kernel/threads-max is the max number of threads; you can set it to 32767 with "echo 32767 > /proc/sys/kernel/threads-max" Haven't tried this myself yet.
Linux 2.2: Even the 2.2.13 kernel limits the number of threads, at least on Intel. I don't know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit. It appears to be integrated into 2.3.20.
See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This stunning little document steps you through a lot of stuff that would be hard to figure out yourself. This is a must-read, even though some of its advice is already out of date.
Java: See Volano's detailed benchmark info, plus their info on how to tune various systems to handle lots of threads.

Java issues

Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.

In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!

HP's java also includes a Thread Polling API.

In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connections are available.

See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.

There are several proposals for improving Java's networking APIs:

Matt Welsh's Jaguar system proposes preserialized objects, new Java bytecodes, and memory management changes to allow the use of asynchronous I/O with Java.
Interfacing Java to the Virtual Interface Architecture, by C-C. Chang and T. von Eicken, proposes memory management changes to allow the use of asynchronous I/O with Java.
JSR-51 was the Sun project that came up with the java.nio package.

Other tips

Zero-Copy
Normally, data gets copied many times on its way from here to there. Any scheme that eliminates these copies to the bare physical minimum is called "zero-copy".
- IO-Lite is a proposal for a set of I/O primitives that gets rid of the need for many copies.
- Alan Cox noted that zero-copy is sometimes not worth the trouble back in 1999. (He did like sendfile(), though.)
- Ingo implemented a form of zero-copy TCP in the 2.4 kernel for TUX 1.0 in July 2000, and says he'll make it available to userspace soon.
- Drew Gallatin and Robert Picco have added some zero-copy features to FreeBSD; the idea seems to be that if you call write() or read() on a socket, the pointer is page-aligned, and the amount of data transferred is at least a page, *and* you don't immediately reuse the buffer, memory management tricks will be used to avoid copies. But see followups to this message on linux-kernel for people's misgivings about the speed of those memory management tricks.
- The sendfile() system call can implement zero-copy networking.
  The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, It's poorly documented at the moment; use _syscall4 to call it. Andi Kleen is writing new man pages that cover this.) Rumor has it, ftp.cdrom.com benefitted noticably from sendfile().
  A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001.
  One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
Avoid small frames by using writev (or TCP_CORK)
A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can't bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), though...
See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.
Behave sensibly on overload.
[Provos, Lever, and Tweedie 2000] notes that dropping incoming connections when the server is overloaded improved the shape of the performance curve, and reduced the overall error rate. They used a smoothed version of "number of clients with I/O ready" as a measure of overload. This technique should be easily applicable to servers written with select, poll, or any system call that returns a count of readiness events per call (e.g. /dev/poll or sigtimedwait4()).
Some programs can benefit from using non-Posix threads.
Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads.
Caching your own data can sometimes be a win.
"Re: fix for hybrid server problems" by Vivek Sadananda Pai (vivek@cs.rice.edu) on new-httpd, May 9th, states:

"I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available at http://www.cs.rice.edu/~vivek/flash99/"

Other limits

Old system libraries might use 16 bit variables to hold file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
Many systems use 16 bit variables to hold process or thread id's. It would be interesting to port the Volano scalability benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
Look at the performance comparison graph at the bottom of http://www.acme.com/software/thttpd/benchmarks.html. Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know.
Note: if the TCP stack has a bug that causes a short (200ms) delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or http daemon has a hard limit on the number of connections open, you would expect exactly this behavior. There may be other causes.

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly. See my Mindcraft Redux page, Kernelnotes.org, Kernel Traffic, and the Linux-Kernel mailing list (Example interesting posts by a user asking how to tune, and Dean Gaudet)

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.

See also The Linux Scalability Project. They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem.

See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can improve HTTP response time over 'slow' connections by 80%.

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

raw connections per second (how many 512 byte files per second can you serve?)
total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.

IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.

Examples

Interesting select()-based servers

thttpd Very simple. Uses a single process. It has good performance, but doesn't scale with the number of CPU's.
mathopd. Similar to thttpd.
fhttpd
boa
Roxen
Zeus, a commercial server that tries to be the absolute fastest. See their tuning guide.
The other non-Java servers listed at http://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
Flash-Lite - web server using IO-Lite.
Flash: An efficient and portable Web server -- uses select(), mmap(), mincore()
xitami - uses select() to implement its own thread abstraction for portability to systems without threads.
Medusa - a server-writing toolkit in Python that tries to deliver very high performance.

Interesting /dev/poll-based servers

N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a version of thttpd modified to support /dev/poll. Performance is compared with phhttpd.

Interesting kqueue()-based servers

Adrian Chadd says "I'm doing a lot of work to make squid actually LIKE a kqueue IO system"; it's an official Squid subproject; see http://squid.sourceforge.net/projects.html#commloops. (This is apparantly newer than Benno's patch.)

Interesting realtime signal-based servers

Chromium's X15. This uses the 2.4 kernel's SIGIO feature together with sendfile() and TCP_CORK, and reportedly achieves higher speed than even TUX. The source is available under a community source (not open source) license. See the original announcement by Fabio Riccardi.
Zach Brown's phhttpd - "a quick web server that was written to showcase the sigio/siginfo event model. consider this code highly experimental and yourself highly mental if you try and use it in a production environment." Uses the siginfo features of 2.3.21 or later, and includes the needed patches for earlier kernels. Rumored to be even faster than khttpd. See his post of 31 May 1999 for some notes.

Interesting thread-based servers

Hoser FTPD. See their benchmark page.
Peter Eriksson's phttpd and
pftpd
The Java-based servers listed at http://www.acme.com/software/thttpd/benchmarks.html
Sun's Java Web Server (which has been reported to handle 500 simultaneous clients)

Interesting in-kernel servers

khttpd
"TUX" (Threaded linUX webserver) by Ingo Molnar et al. For 2.4 kernel.

Contents

Interesting /dev/poll-based servers