It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.
And computers are big, too. You can buy a 500MHz machine with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $1500 or so. Let's see - at 10000 clients, that's 50KHz, 100Kbytes, and 60Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of ten thousand clients. (That works out to $0.15 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.
One of the busiest ftp sites, cdrom.com, actually can handle 10000 clients simultaneously through a Gigabit Ethernet pipe to its ISP. ( Here's a somewhat dated page about their configuration.) Pipes this fast aren't common yet, but technology is improving rapidly.
There's interest in benchmarking this kind of configuration; see the discussion on June 28th 1999 on linux-kernel, in which people propose setting up a testbed that can be used for ongoing benchmarks.
There appears to be interest in this direction from the NSF, too; see the Web100 project.
With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, for obvious reasons.
If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem.
There seem to be four ways of writing a fast web server to handle many clients:
... set nonblocking mode on all network handles, and use
select() or poll() to tell which network handle has data waiting.
This is the traditional favorite.
It's still being improved; see e.g.
Niels Provos' benchmarks with
hinting poll and thttpd.
An important bottleneck in this method is that read() or sendfile()
from disk blocks if the page is not in core at the moment;
setting nonblocking mode on a disk file handle has no effect.
Same thing goes for memory-mapped disk files.
The first time a server needs disk I/O, its process blocks,
all clients must wait, and that raw nonthreaded performance goes to waste.
Worker threads or processes that do the disk I/O can get around this
bottleneck. One approach is to use memory-mapped files,
and if mincore() indicates I/O is needed, ask a worker to do the I/O,
and continue handling network traffic. Jef Poskanzer mentions that
Pai, Druschel, and Zwaenepoel's Flash web server uses this trick; they gave a talk at
Usenix '99 on it.
It looks like mincore() is available in BSD-derived Unixes
like FreeBSD
and Solaris, but is not part
of the Single Unix Specification.
It's available as part of Linux as of kernel 2.3.51,
thanks to Chuck Lever.
There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:
The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you're interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle. It looks like this is lower overhead than even the realtime signal approach described below.
It appeared quietly in Solaris 7 (see patchid 106541) but its first public appearance was in Solaris 8; according to Sun, at 750 clients, this has 10% of the overhead of poll().
Linux is also starting to support /dev/poll, although it's not in the main kernel as of 2.4.4. See N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] A patch implementing /dev/poll for the 2.2.14 Linux kernel is available from www.citi.umic.edu. Note: Niels reports that under heavy load it can drop some readiness events. Kevin Clark updated it for the 2.4.3 kernel (see www.cs.unh.edu/~kdc/dev-poll-patch-linux-2.4.3), and is looking at the load problem (as of 16 May 2001)...
On 11 July 2001, Davide Libenzi posted a reimplementation of /dev/poll for Linux; see www.xmailserver.org/linux-patches/nio-improve.html for his results. Unlike the previous implementation of /dev/poll for Linux, it is said to be completely free of linear scans, and scales better to thousands of connections.
Solaris's /dev/poll support may be the first practical example of the kind of API proposed in [Banga, Mogul, Drusha '99], although Winsock's WSAsyncSelect might qualify, too.
FreeBSD 5.0 (and possibly 4.1) support a generalized alternative to poll() called kqueue()/kevent(). (See also Jonathan Lemon's page.) Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you call kqueue() to allocate one. To change the events you are listening for, or to get the list of current events, you call kevent() on the descriptor returned by kqueue(). It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion.
Note: as of October 2000, the threading library on FreeBSD does not interact well with kqueue(); evidently, when kqueue() blocks, the entire process blocks, not just the calling thread.
Examples and libraries using kqueue():
The 2.3.21 linux kernel introduced a way of using realtime signals to replace poll.
fcntl(fd, F_SETSIG, signum) associates a signal with each file descriptor, and raises that signal
when a normal I/O function like read() or write() completes.
To use this, you choose a realtime signal number, mask that signal and SIGIO with sigprocmask(),
use fcntl(fd, F_SETSIG, signum) on each fd, write a normal poll() outer loop, and
inside it, after you've handled all the fd's noticed by poll(), you
loop calling
sigwaitinfo().
If sigwaitinfo or sigtimedwait returns your realtime signal, siginfo.si_fd and siginfo.si_band give the
same information as pollfd.fd and pollfd.revents would after a call to poll(),
so you handle the i/o, and continue calling sigwaitinfo().
If sigwaitinfo returns a traditional SIGIO, the signal queue overflowed,
so you flush the signal queue by temporarily changing the signal handler to SIG_DFL,
and break back to the outer poll() loop.
You can support older kernels by surrounding
the sigwaitinfo code with #if defined(LINUX_VERSION_CODE) && (LINUX_VERSION_CODE >= KERNEL_VERSION(2.3.21))
See Zach Brown's phhttpd for example code that uses this feature, and deals with some of the gotchas,
e.g. events queued before an fd is dealt with or closed, but arriving afterwards.
[Provos, Lever, and Tweedie 2000] describes a recent benchmark of phhttpd using a variant of sigtimedwait(), sigtimedwait4(), that lets you retrieve multiple signals with one call. Interestingly, the chief benefit of sigtimedwait4() for them seemed to be it allowed the app to gauge system overload (so it could behave appropriately). (Note that poll() provides the same measure of system overload.)
Signal-per-fd
www.hpl.hp.com/techreports/2000/HPL-2000-174.html
by Chandra and Mosberger proposes a modification called
"signal-per-fd" which gets rid of the possibility of realtime signal
queue overflow, and compares performance of this scheme with select()
and /dev/poll. Read it!
A patch implementing the scheme proposed there was
posted to linux-kernel on 18 May 2001 by Vitaly Luban,
and lives at www.luban.org/GPL/gpl.html.
... and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS's also have trouble handling more than a few hundred threads.
This has not yet become popular, probably because few operating systems support asynchronous I/O, also possibly because it's hard to use. Under Unix, this is done with the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2, and in glibc 2.1. The generic glibc 2.1 implementation may have been written for standards compliance rather than performance.
SGI has implemented high-speed AIO with kernel support. As of version 1.1, it's said to work well with both disk I/O and sockets. It seems to use kernel threads.
Various people appear to be working on a different implementation that does not use kernel threads, and should scale better. It won't be available until kernel 2.5, though, probably.
The O'Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.
Novell and Microsoft are both said to have done this at various times,
at least one NFS implementation does this,
khttpd does this for Linux
and static web pages, and
"TUX" (Threaded linUX webserver)
is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux.
Ingo's September 1, 2000 announcement
says an alpha version of TUX can be downloaded from
ftp://ftp.redhat.com/pub/redhat/tux,
and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this
approach, and the consensus seems to be instead of moving web servers
into the kernel, the kernel should have the smallest possible hooks added
to improve web server performance. That way, other kinds of servers
can benefit. See e.g.
Zach Brown's remarks
about userland vs. kernel http servers.
Richard Gooch has written a paper discussing I/O options.
The Apache mailing lists have some interesting posts about why they prefer not to use select() (basically, they think that makes plugins harder). Still, they're planning to use select()/poll()/sendfile() for static requests in Apache 2.0.
Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asynchronous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, a rebuttal from Mingo, Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel. I suspect he was trying to say that Linux doesn't support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it's not so true anymore.
See these pages at sysinternals.com and MSDN for information on "completion ports", which he said were unique to NT; in a nutshell, win32's "overlapped I/O" turned out to be too low level to be convenient, and a "completion port" is a wrapper that provides a queue of completion events. Compare this to Linux's F_SETSIG and queued signals feature.
There was an interesting discussion on linux-kernel in September 1999 titled "> 15,000 Simultaneous Connections". (second week, third week) Highlights:
Interesting reading!
set kern.maxfiles=XXXXwhere XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he'd achieved far more than 10000 connections on FreeBSD 4.3, and says
"FWIW: You can't actually tune the maximum number of connections in FreeBSD trivially, via sysctl.... You have to do it in the /boot/loader.conf file.
The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable.
You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive."
echo 32768 > /proc/sys/fs/file-max echo 65536 > /proc/sys/fs/inode-maxincreases the system limit on open files, and
ulimit -n 32768increases the current process' limit. I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.
See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This stunning little document steps you through a lot of stuff that would be hard to figure out yourself. This is a must-read, even though some of its advice is already out of date.
Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.
In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!
HP's java also includes a Thread Polling API.
In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connections are available.
See also Dean Gaudet's essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.
There are several proposals for improving Java's networking APIs:
A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001.
One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.
"I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available at http://www.cs.rice.edu/~vivek/flash99/"
For Linux, it looks like kernel bottlenecks are being fixed constantly. See my Mindcraft Redux page, Kernelnotes.org, Kernel Traffic, and the Linux-Kernel mailing list (Example interesting posts by a user asking how to tune, and Dean Gaudet)
In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.
See also The Linux Scalability Project. They're doing interesting work, including Niels Provos' hinting poll patch, and some work on the thundering herd problem.
See also Mike Jagdis' work on improving select() and poll(); here's Mike's post about it.
Two tests in particular are simple, interesting, and hard:
Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.
I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.
Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.
IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.
Copyright 1999-2001 Dan Kegel
dank@alumni.caltech.edu
Last updated: 11 July 2001
[Return to www.kegel.com]