The C10K problem

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 500MHz machine with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $1500 or so. Let's see - at 10000 clients, that's 50KHz, 100Kbytes, and 60Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of ten thousand clients. (That works out to $0.15 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

One of the busiest ftp sites, cdrom.com, actually can handle 10000 clients simultaneously through a Gigabit Ethernet pipe to its ISP. ( Here's a somewhat dated page about their configuration.) Pipes this fast aren't common yet, but technology is improving rapidly.

There's interest in benchmarking this kind of configuration; see the discussion on June 28th 1999 on linux-kernel, in which people propose setting up a testbed that can be used for ongoing benchmarks.

There appears to be interest in this direction from the NSF, too; see the Web100 project.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, for obvious reasons.

Contents

Book to Read First

If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem.

I/O Strategies

There seem to be four ways of writing a fast web server to handle many clients:

  1. serve many clients with each server thread, and use nonblocking I/O
  2. serve one client with each server thread
  3. Serve many clients with each server thread, and use asynchronous I/O
  4. Build the server code into the kernel
plus there exist I/O frameworks that provide a higher level of abstraction; potentially these can pick the appropriate I/O method for whatever platform you're compiling for, providing easy porting and high performance.

I/O frameworks

1. Serve many clients with each server thread, and use nonblocking I/O

... set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. It's still being improved; see e.g. Niels Provos' benchmarks with hinting poll and thttpd.
An important bottleneck in this method is that read() or sendfile() from disk blocks if the page is not in core at the moment; setting nonblocking mode on a disk file handle has no effect. Same thing goes for memory-mapped disk files. The first time a server needs disk I/O, its process blocks, all clients must wait, and that raw nonthreaded performance goes to waste.
Worker threads or processes that do the disk I/O can get around this bottleneck. One approach is to use memory-mapped files, and if mincore() indicates I/O is needed, ask a worker to do the I/O, and continue handling network traffic. Jef Poskanzer mentions that Pai, Druschel, and Zwaenepoel's Flash web server uses this trick; they gave a talk at Usenix '99 on it. It looks like mincore() is available in BSD-derived Unixes like FreeBSD and Solaris, but is not part of the Single Unix Specification. It's available as part of Linux as of kernel 2.3.51, thanks to Chuck Lever.

There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:

2. Serve one client with each server thread

... and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS's also have trouble handling more than a few hundred threads.

3. Serve many clients with each server thread, and use asynchronous I/O

This has not yet become popular, probably because few operating systems support asynchronous I/O, also possibly because it's hard to use. Under Unix, this is done with the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2, and in glibc 2.1. The generic glibc 2.1 implementation may have been written for standards compliance rather than performance.

SGI has implemented high-speed AIO with kernel support. As of version 1.1, it's said to work well with both disk I/O and sockets. It seems to use kernel threads.

Various people appear to be working on a different implementation that does not use kernel threads, and should scale better. It won't be available until kernel 2.5, though, probably.

The O'Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.

4. Build the server code into the kernel

Novell and Microsoft are both said to have done this at various times, at least one NFS implementation does this, khttpd does this for Linux and static web pages, and "TUX" (Threaded linUX webserver) is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux. Ingo's September 1, 2000 announcement says an alpha version of TUX can be downloaded from ftp://ftp.redhat.com/pub/redhat/tux, and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this approach, and the consensus seems to be instead of moving web servers into the kernel, the kernel should have the smallest possible hooks added to improve web server performance. That way, other kinds of servers can benefit. See e.g. Zach Brown's remarks about userland vs. kernel http servers.

Comments

Richard Gooch has written a paper discussing I/O options.

The Apache mailing lists have some interesting posts about why they prefer not to use select() (basically, they think that makes plugins harder). Still, they're planning to use select()/poll()/sendfile() for static requests in Apache 2.0.

Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asynchronous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, a rebuttal from Mingo, Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel. I suspect he was trying to say that Linux doesn't support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it's not so true anymore.

See these pages at sysinternals.com and MSDN for information on "completion ports", which he said were unique to NT; in a nutshell, win32's "overlapped I/O" turned out to be too low level to be convenient, and a "completion port" is a wrapper that provides a queue of completion events. Compare this to Linux's F_SETSIG and queued signals feature.

There was an interesting discussion on linux-kernel in September 1999 titled "> 15,000 Simultaneous Connections". (second week, third week) Highlights:

Interesting reading!

Limits on open filehandles

Limits on threads

On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.

Java issues

Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.

In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!

HP's java also includes a Thread Polling API.

In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connections are available.

See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.

There are several proposals for improving Java's networking APIs:

Other tips

Other limits

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly. See my Mindcraft Redux page, Kernelnotes.org, Kernel Traffic, and the Linux-Kernel mailing list (Example interesting posts by a user asking how to tune, and Dean Gaudet)

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.

See also The Linux Scalability Project. They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem.

See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can improve HTTP response time over 'slow' connections by 80%.

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

  1. raw connections per second (how many 512 byte files per second can you serve?)
  2. total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.

IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.

Examples

Interesting select()-based servers

Interesting /dev/poll-based servers

Interesting kqueue()-based servers

Interesting realtime signal-based servers

Interesting thread-based servers

Interesting in-kernel servers

Other interesting links


Copyright 1999-2001 Dan Kegel
dank@alumni.caltech.edu
Last updated: 11 July 2001
[Return to www.kegel.com]