Web servers dish out billions of web pages a day. They tell you the weather, load up your online shopping carts, and let you find long-lost high-school buddies. Web servers are the workhorses of the World Wide Web. In this chapter, we:
•
Survey the many different types of software and hardware web servers.
•
Describe how to write a simple diagnostic web server in Perl.
•
Explain how web servers process HTTP transactions, step by step.
Where it helps to make things concrete, our examples use the Apache web server and its configuration options.s
1. Web Servers Come in All Shapes and Sizes
A web server processes HTTP requests and serves responses. The term “web server” can refer either to web server software or to the particular device or computer dedicated to serving the web pages.
Web servers comes in all flavors, shapes, and sizes. There are trivial 10-line Perl script web servers, 50-MB secure commerce engines, and tiny servers-on-a-card. But whatever the functional differences, all web servers receive HTTP requests for resources and serve content back to the clients (look back to Figure 1-5).
Figure 1-5. HTTP transactions consist of request and response messages
1.1. Web Server Implementations
Web servers implement HTTP and the related TCP connection handling. They also manage the resources served by the web server and provide administrative features to configure, control, and enhance the web server.
The web server logic implements the HTTP protocol, manages web resources, and provides web server administrative capabilities. The web server logic shares responsi- bilities for managing TCP connections with the operating system. The underlying operating system manages the hardware details of the underlying computer system and provides TCP/IP network support, filesystems to hold web resources, and process management to control current computing activities.
Web servers are available in many forms:
•
You can install and run general-purpose software web servers on standard com
puter systems.
•
If you don’t want the hassle of installing software, you can purchase a webserver appliance, in which the software comes preinstalled and preconfigured on a
computer, often in a snazzy-looking chassis.
•
Given the miracles of microprocessors, some companies even offer embedded
web servers implemented in a small number of computer chips, making them
perfect administration consoles for consumer devices.
Let’s look at each of those types of implementations.
1.2. General-Purpose Software Web Servers
General-purpose software web servers run on standard, network-enabled computer systems. You can choose open source software (such as Apache or W3C’s Jigsaw) or commercial software (such as Microsoft’s and iPlanet’s web servers). Web server software is available for just about every computer and operating system.
While there are tens of thousands of different kinds of web server programs (including custom-crafted, special-purpose web servers), most web server software comes from a small number of organizations.
In February 2002, the Netcraft survey (http://www.netcraft.com/survey/) showed three vendors dominating the public Internet web server market (see Figure 5-1):
•
The free Apache software powers nearly 60% of all Internet web servers.
•
Microsoft web server makes up another 30%.
•
Sun iPlanet servers comprise another 3%.
Figure 5-1. Web server market share as estimated by Netcraft's automated survey
너무 옛날 자료라, 최신자료 올림
Take these numbers with a few grains of salt, however, as the Netcraft survey is com- monly believed to exaggerate the dominance of Apache software. First, the survey counts servers independent of server popularity. Proxy server access studies from large ISPs suggest that the amount of pages served from Apache servers is much less than 60% but still exceeds Microsoft and Sun iPlanet. Additionally, it is anecdotally believed that Microsoft and iPlanet servers are more popular than Apache inside corporate enterprises.
1.3. Embedded Web Servers
Embedded servers are tiny web servers intended to be embedded into consumer products (e.g., printers or home appliances). Embedded web servers allow users to administer their consumer devices using a convenient web browser interface.
Some embedded web servers can even be implemented in less than one square inch, but they usually offer a minimal feature set. Two examples of very small embedded web servers are:
•
IPic match-head sized web server (http://www-ccs.cs.umass.edu/~shri/iPic.html)
•
NetMedia SitePlayer SP1 Ethernet Web Server (http://www.siteplayer.com)
2. A Minimal Perl Web Server
If you want to build a full-featured HTTP server, you have some work to do. The core of the Apache web server has over 50,000 lines of code, and optional processing modules make that number much bigger.
All this software is needed to support HTTP/1.1 features: rich resource support, virtual hosting, access control, logging, configuration, monitoring, and performance features. That said, you can create a minimally functional HTTP server in under 30 lines of Perl. Let’s take a look.
Example 5-1 shows a tiny Perl program called type-o-serve. This program is a useful diagnostic tool for testing interactions with clients and proxies. Like any web server, type-o-serve waits for an HTTP connection. As soon as type-o-serve gets the request message, it prints the message on the screen; then it waits for you to type (or paste) in a response message, which is sent back to the client. This way, type-o-serve pretends to be a web server, records the exact HTTP request messages, and allows you to send back any HTTP response message.
This simple type-o-serve utility doesn’t implement most HTTP functionality, but it is a useful tool to generate server response messages the same way you can use Telnet to generate client request messages (refer back to Example 5-1). You can download the type-o-serve program from http://www.http-guide.com/tools/type-o-serve.pl.
#!/usr/bin/perl
use Socket;
use Carp;
use FileHandle;
# (1) use port 8080 by default, unless overridden on command line
$port = (@ARGV ? $ARGV[0] : 8080);
# (2) create local TCP socket and set it to listen for connections
$proto = getprotobyname('tcp');
socket(S, PF_INET, SOCK_STREAM, $proto) || die;
setsockopt(S, SOL_SOCKET, SO_REUSEADDR, pack("l", 1)) || die;
bind(S, sockaddr_in($port, INADDR_ANY)) || die;
listen(S, SOMAXCONN) || die;
# (3) print a startup message
printf(" <<<Type-O-Serve Accepting on Port %d>>>\n\n",$port);
while (1) {
# (4) wait for a connection C
$cport_caddr = accept(C, S);
($cport,$caddr) = sockaddr_in($cport_caddr);
C->autoflush(1);
# (5) print who the connection is from
$cname = gethostbyaddr($caddr,AF_INET);
printf(" <<<Request From '%s'>>>\n",$cname);
# (6) read request msg until blank line, and print on screen
while ($line = <C>)
{
print $line;
if ($line =~ /^\r/) { last; }
}
# (7) prompt for response message, and input response lines,
# sending response lines to client, until solitary "."
printf(" <<<Type Response Followed by '.'>>>\n");
while ($line = <STDIN>)
{
$line =~ s/\r//;
$line =~ s/\n//;
if ($line =~ /^\./) { last; }
print C $line . "\r\n";
}
close(C);
}
Perl
복사
Example 5-1. type-o-server—a minimal Perl web server used for HTTP debugging
Figure 5-2 show show the administrator of Joe’s Hardware store might use type-o-serve to test HTTP communication:
•
First, the administrator starts the type-o-serve diagnostic server, listening on a
particular port. Because Joe’s Hardware store already has a production web
server listing on port 80, the administrator starts the type-o-serve server on port
8080 (you can pick any unused port) with this command line:
% type-o-serve.pl 8080
•
Once type-o-serve is running, you can point a browser to this web server. In Figure 5-2, we browse to http://www.joes-hardware.com:8080/foo/bar/blah.txt.
•
The type-o-serve program receives the HTTP request message from the browser
and prints the contents of the HTTP request message on screen. The type-o-serve
diagnostic tool then waits for the user to type in a simple response message, followed by a period on a blank line.
•
type-o-serve sends the HTTP response message back to the browser, and the
browser displays the body of the response message.
Figure 5-2. The type-o-serve utility lets you type in server responses to send back to clients
3. What Real Web Servers Do
The Perl server we showed in Example 5-1 is a trivial example web server. State-of- the-art commercial web servers are much more complicated, but they do perform several common tasks, as shown in Figure 5-3:
1.
Set up connection—accept a client connection, or close if the client is unwanted.
2.
Receive request—read an HTTP request message from the network.
3.
Process request—interpret the request message and take action.
4.
Access resource—access the resource specified in the message.
5.
Construct response—create the HTTP response message with the right headers.
6.
Send response—send the response back to the client.
7.
Log transaction—place notes about the completed transaction in a log file.
Figure 5-3. Steps of a basic web server request
The next seven sections highlight how web servers perfrom these basic tasks.
4. Step 1: Accepting Client Connections
If a client already has a persistent connection open to the server, it can use that connecti-on to send its request. Otherwise, the client needs to open a new connection to the server (refer back to Chapter 4 to review HTTP connection-management technology).
4.1. Handling New Connections
When a client requests a TCP connection to the web server, the web server estab-
lishes the connection and determines which client is on the other side of the connec-
tion, extracting the IP address from the TCP connection.* Once a new connection is
established and accepted, the server adds the new connection to its list of existing
web server connections and prepares to watch for data on the connection.
The web server is free to reject and immediately close any connection. Some web
servers close connections because the client IP address or hostname is unauthorized
or is a known malicious client. Other identification techniques can also be used.
4.2. Client Hostname Identification
Most web servers can be configured to convert client IP addresses into client host- names, using “reverse DNS”. Web servers can use the client hostname for detailed access control and logging. Be warned that hostname lookups can take a very long time, slowing down web transactions. Many high-capacity web servers either disable hostname resolution or enable it only for particular content.
You can enable hostname lookups in Apache with the HostnameLookups configuration directive. For example, the Apache configuration directives in Example 5-2 turn on hostname resolution for only HTML and CGI resources.
Example 5-2. Configuring Apache to look up hostnames for HTML and CGI resources
4.3. Determining the Client User Through ident
Some web servers also support the IETF ident protocol. The ident protocol lets servers find out what username initiated an HTTP connection. This information is particularly useful for web server logging—the second field of the popular Common Log Format contains the ident username of each HTTP request.*
If a client supports the ident protocol, the client listens on TCP port 113 for ident requests. Figure 5-4 shows how the ident protocol works. In Figure 5-4a, the client opens an HTTP connection. The server then opens its own connection back to the client’s identd server port (113), sends a simple request asking for the username corresponding to the new connection (specified by client and server port numbers), and retrieves from the client the response containing the username.
Figure 5-4. Using the ident protocol to determine HTTP client username
ident can work inside organizations, but it does not work well across the public Internet for many reasons, including:
•
Many client PCs don’t run the identd Identification Protocol daemon software.
•
The ident protocol significantly delays HTTP transactions.
•
Many firewalls won’t permit incoming ident traffic.
•
The ident protocol is insecure and easy to fabricate.
•
The ident protocol doesn’t support virtual IP addresses well.
•
There are privacy concerns about exposing client usernames.
You can tell Apache web servers to use ident lookups with Apache’s IdentityCheck on directive. If no ident information is available, Apache will fill ident log fields with hyphens (-). Common Log Format log files typically contain hyphens in the second field because no ident information is available.
5. Step 2: Receiving Request Messages
As the data arrives on connections, the web server reads out the data from the network connection and parses out the pieces of the request message (Figure 5-5).
Figure 5-5. Reading a request message from a connection
When parsing the request message, the web server:
•
Parses the request line looking for the request method, the specified resource
identifier (URI), and the version number, each separated by a single space, and
ending with a carriage-return line-feed (CRLF) sequence
•
Reads the message headers, each ending in CRLF
•
Detects the end-of-headers blank line, ending in CRLF (if present)
•
Reads the request body, if any (length specified by the Content-Length header)
When parsing request messages, web servers receive input data erratically from the network. The network connection can stall at any point. The web server needs to read data from the network and temporarily store the partial message data in memory until it receives enough data to parse it and make sense of it.
5.1. Internal Representations of Messages
Some web servers also store the request messages in internal data structures that make the message easy to manipulate. For example, the data structure might contain pointers and lengths of each piece of the request message, and the headers might be stored in a fast lookup table so the specific values of particular headers can be accessed quickly (Figure 5-6).
Figure 5-6. Parsing a request message into a convenient internal representation
5.2. Connection Input/Output Processing Architectures
High-performance web servers support thousands of simultaneous connections. These connections let the web server communicate with clients around the world, each with one or more connections open to the server. Some of these connections may be sending requests rapidly to the web server, while other connections trickle requests slowly or infrequently, and still others are idle, waiting quietly for some future activity.
Web servers constantly watch for new web requests, because requests can arrive at any time. Different web server architectures service requests in different ways, as Figure 5-7 illustrates:
Figure 5-7. Web server input/output architectures
Single-threaded web servers (Figure 5-7a)
Single-threaded web servers process one request at a time until completion. When the transaction is complete, the next connection is processed. This architecture is simple to implement, but during processing, all the other connections are ignored. This creates serious performance problems and is appropriate only for low-load servers and diagnostic tools like type-o-serve.
Multiprocess and multithreaded web servers (Figure 5-7b)
Multiprocess and multithreaded web servers dedicate multiple processes or higher-efficiency threads to process requests simultaneously.* The threads/ processes may be created on demand or in advance.† Some servers dedicate a thread/process for every connection, but when a server processes hundreds, thousands, or even tens of thousands of simultaneous connections, the resulting number of processes or threads may consume too much memory or system resources. Thus, many multithreaded web servers put a limit on the maximum number of threads/processes.
Multiplexed I/O servers (Figure 5-7c)
To support large numbers of connections, many web servers adopt multiplexed architectures. In a multiplexed architecture, all the connections are simultaneously watched for activity. When a connection changes state (e.g., when data becomes available or an error condition occurs), a small amount of processing is performed on the connection; when that processing is complete, the connection is returned to the open connection list for the next change in state. Work is done on a connection only when there is something to be done; threads and processes are not tied up waiting on idle connections.
Multiplexed multithreaded web servers (Figure 5-7d)
Some systems combine multithreading and multiplexing to take advantage of multiple CPUs in the computer platform. Multiple threads (often one per physical processor) each watch the open connections (or a subset of the open connections) and perform a small amount of work on each connection.
6. Step 3: Processing Requests
Once the web server has received a request, it can process the request using the
method, resource, headers, and optional body.
Some methods (e.g., POST) require entity body data in the request message. Other
methods (e.g., OPTIONS) allow a request body but don’t require one. A few methods (e.g., GET) forbid entity body data in request messages.
We won’t talk about request processing here, because it’s the subject of most of the
chapters in the rest of this book!
7. Step 4: Mapping and Accessing Resources
Web servers are resource servers. They deliver precreated content, such as HTML
pages or JPEG images, as well as dynamic content from resource-generating applica-
tions running on the servers.
Before the web server can deliver content to the client, it needs to identify the source
of the content, by mapping the URI from the request message to the proper content
or content generator on the web server.
7.1. Docroots
Web servers support different kinds of resource mapping, but the simplest form of resource mapping uses the request URI to name a file in the web server’s filesystem. Typically, a special folder in the web server filesystem is reserved for web content. This folder is called the document root, or docroot. The web server takes the URI from the request message and appends it to the document root.
In Figure 5-8, a request arrives for /specials/saw-blade.gif. The web server in this example has document root /usr/local/httpd/files. The web server returns the file /usr/ local/httpd/files/specials/saw-blade.gif.
Figure 5-8. Mapping request URI to local web server resource
To set the document root for an Apache web server, add a DocumentRoot line to the
httpd.conf configuration file:
DocumentRoot /usr/local/httpd/files
Servers are careful not to let relative URLs back up out of a docroot and expose other
parts of the filesystem. For example, most mature web servers will not permit this
URI to see files above the Joe’s Hardware document root:
http://www.joes-hardware.com/../
Virtaully hosted docroots
Virtually hosted web servers host multiple web sites on the same web server, giving each site its own distinct document root on the server. A virtually hosted web server identifies the correct document root to use from the IP address or hostname in the URI or the Host header. This way, two web sites hosted on the same web server can have completely distinct content, even if the request URIs are identical.
In Figure 5-9, the server hosts two sites: www.joes-hardware.com and www.marys- antiques.com. The server can distinguish the web sites using the HTTP Host header, or from distinct IP addresses.
•
When request A arrives, the server fetches the file for /docs/joe/index.html.
•
When request B arrives, the server fetches the file for /docs/mary/index.html.
Figure 5-9. Different docroots for virtually hosted requests
Configuring virtually hosted docroots is simple for most web servers. For the popular Apache web server, you need to configure a VirtualHost block for each virtual web site, and include the DocumentRoot for each virtual server (Example 5-3).
<VirtualHost www.joes-hardware.com>
ServerName www.joes-hardware.com
DocumentRoot /docs/joe
TransferLog /logs/joe.access_log
ErrorLog /logs/joe.error_log
</VirtualHost>
<VirtualHost www.marys-antiques.com>
ServerName www.marys-antiques.com
DocumentRoot /docs/mary
TransferLog /logs/mary.access_log
ErrorLog /logs/mary.error_log
</VirtualHost>
....
XML
복사
Look forward to “Virtual Hosting” in Chapter 18 for much more detail about virtual hosting.
User home directory docroots
Another common use of docroots gives people private web sites on a web server. A typical convention maps URIs whose paths begin with a slash and tilde (/~) followed by a username to a private document root for that user. The private docroot is often the folder called public_html inside that user’s home directory, but it can be configured differently (Figure 5-10).
Figure 5-10. Different docroots for different users
7.2. Directory Listings
A web server can receive requests for directory URLs, where the path resolves to a directory, not a file. Most web servers can be configured to take a few different actions when a client requests a directory URL:
•
Return an error.
•
Return a special, default, “index file” instead of the directory.
•
Scan the directory, and return an HTML page containing the contents.
Most web servers look for a file named index.html or index.htm inside a directory to represent that directory. If a user requests a URL for a directory and the directory contains a file named index.html (or index.htm), the server will return the contents of that file.
In the Apache web server, you can configure the set of filenames that will be interpreted as default directory files using the DirectoryIndex configuration directive. The
DirectoryIndex directive lists all filenames that serve as directory index files, in preferred order. The following configuration line causes Apache to search a directory for
any of the listed files in response to a directory URL request:
DirectoryIndex index.html index.htm home.html home.htm index.cgi
If no default index file is present when a user requests a directory URI, and if directory indexes are not disabled, many web servers automatically return an HTML file listing the files in that directory, and the sizes and modification dates of each file, including URI links to each file. This file listing can be convenient, but it also allows nosy people to find files on a web server that they might not normally find.
You can disable the automatic generation of directory index files with the Apache directive:
Options -Indexes
7.3. Dynamic Content Resource Mapping
Web servers also can map URIs to dynamic resources—that is, to programs that generate content on demand (Figure 5-11). In fact, a whole class of web servers called application servers connect web servers to sophisticated backend applications. The web server needs to be able to tell when a resource is a dynamic resource, where the dynamic content generator program is located, and how to run the program. Most web servers provide basic mechanisms to identify and map dynamic resources.
Figure 5-11. A web server can serve static resources as well as dynamic resources
Apache lets you map URI pathname components into executable program directories. When a server receives a request for a URI with an executable path component, it attempts to execute a program in a corresponding server directory. For example, the following Apache configuration directive specifies that all URIs whose paths begin with /cgi-bin/ should execute corresponding programs found in the directory /usr/local/etc/httpd/cgi-programs/:
ScriptAlias /cgi-bin/ /usr/local/etc/httpd/cgi-programs/
Apache also lets you mark executable files with a special file extension. This way, executable scripts can be placed in any directory. The following Apache configuration directive specifies that all web resources ending in .cgi should be executed:
AddHandler cgi-script .cgi
CGI is an early, simple, and popular interface for executing server-side applications. Modern application servers have more powerful and efficient server-side dynamic content support, including Microsoft’s Active Server Pages and Java servlets.
7.4. Server-Side Includes (SSI)
Many web servers also provide support for server-side includes. If a resource is flagged as containing server-side includes, the server processes the resource contents before sending them to the client.
The contents are scanned for certain special patterns (often contained inside special HTML comments), which can be variable names or embedded scripts. The special patterns are replaced with the values of variables or the output of executable scripts. This is an easy way to create dynamic content.
7.5. Access Controls
Web servers also can assign access controls to particular resources. When a request arrives for an access-controlled resource, the web server can control access based on the IP address of the client, or it can issue a password challenge to get access to the resource.
Refer to Chapter 12 for more information about HTTP authentication.
8. Step 5: Building Responses
Once the web server has identified the resource, it performs the action described in the request method and returns the response message. The response message contains a response status code, response headers, and a response body if one was generated. HTTP response codes were detailed in “Status Codes” in Chapter 3.
8.1. Response Entities
If the transaction generated a response body, the content is sent back with the
response message. If there was a body, the response message usually contains:
•
A Content-Type header, describing the MIME type of the response body
•
A Content-Length header, describing the size of the response body
•
The actual message body content
8.2. MIME Typing
The web server is responsible for determining the MIME type of the response body. There are many ways to configure servers to associate MIME types with resources:
mime.types
The web server can use the extension of the filename to indicate MIME type.
The web server scans a file containing MIME types for each extension to compute the MIME type for each resource. This extension-based type association is the most common; it is illustrated in Figure 5-12.
Figure 5-12. A web server uses MIME types of file to set outgoing Content-Type of resources
Magic typing
The Apache web server can scan the contents of each resource and pattern-match the content against a table of known patterns (called the magic file) to determine the MIME type for each file. This can be slow, but it is convenient, especially if the files are named without standard extensions.
Explicit typing
Web servers can be configured to force particular files or directory contents to have a MIME type, regardless of the file extension or contents.
Type negotiation
Some web servers can be configured to store a resource in multiple document formats. In this case, the web server can be configured to determine the “best” format to use (and the associated MIME type) by a negotiation process with the user. We’ll discuss this in Chapter 17.
Web servers also can be configured to associate particular files with MIME types.
8.3. Redirection
Web servers sometimes return redirection responses instead of success messages. A web server can redirect the browser to go elsewhere to perform the request. A redirec- tion response is indicated by a 3XX return code. The Location response header contains a URI for the new or preferred location of the content. Redirects are useful for:
Permanently moved resources
A resource might have been moved to a new location, or otherwise renamed, giving it a new URL. The web server can tell the client that the resource has been renamed, and the client can update any bookmarks, etc. before fetching the resource from its new location. The status code 301 Moved Permanently is used for this kind of redirect.
Temporarily moved resources
If a resource is temporarily moved or renamed, the server may want to redirect the client to the new location. But, because the renaming is temporary, the server wants the client to come back with the old URL in the future and not to update any bookmarks. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
URL augmentation
Servers often use redirects to rewrite URLs, often to embed context. When the request arrives, the server generates a new URL containing embedded state information and redirects the user to this new URL.* The client follows the redirect, reissuing the request, but now including the full, state-augmented URL. This is a useful way of maintaining state across transactions. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
Load balancing
If an overloaded server gets a request, the server can redirect the client to a less heavily loaded server. The status codes 303 See Other and 307 Temporary Redi- rect are used for this kind of redirect.
Server affinity
Web servers may have local information for certain users; a server can redirect the client to a server that contains information about the client. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
Canonicalizing directory names
When a client requests a URI for a directory name without a trailing slash, most web servers redirect the client to a URI with the slash added, so that relative links work correctly.
9. Step 6: Sending Responses
Web servers face similar issues sending data across connections as they do receiving. The server may have many connections to many clients, some idle, some sending data to the server, and some carrying response data back to the clients.
The server needs to keep track of the connection state and handle persistent connections with special care. For nonpersistent connections, the server is expected to close its side of the connection when the entire message is sent.
For persistent connections, the connection may stay open, in which case the server needs to be extra cautious to compute the Content-Length header correctly, or the client will have no way of knowing when a response ends (see Chapter 4).
10. Step 7: Logging
Finally, when a transaction is complete, the web server notes an entry into a log file, describing the transaction performed. Most web servers provide several configurable forms of logging. Refer to Chapter 21 for more details.