Author Topic: Cache-Friendly Web Pages  (Read 491 times)

akash.datasoft

  • Green Belt
  • *****
  • Posts: 216
  • Karma: +0/-0
Cache-Friendly Web Pages
« on: October 17, 2013, 07:01:00 PM »
Cache-Friendly Web Pages


There are a lot of HTTP caches out there. How long are they holding your pages? How long should they hold your pages? RFC 2616 (HTTP/1.1) specifies that caches must obey Expires and Cache-Control headers--but do your pages have them? How do you add them? What happens to your pages if you don't?

 
 
"The goal of caching in HTTP/1.1 is to eliminate the need to send requests in many cases, and to eliminate the need to send full responses in many other cases." RFC 2616


Advantages of cache-friendly pages

"HTTP caching works best when caches can entirely avoid making requests to the origin server. The primary mechanism for avoiding requests is for an origin server to provide an explicit expiration time in the future, indicating that a response MAY be used to satisfy subsequent requests. In other words, a cache can return a fresh response without first contacting the server." RFC 2616

The RFC was written with the expectation that Web pages would include expiration headers. If the expiration times in the headers are chosen carefully, caches can serve stored pages without losing any meaning.

When origin servers don't provide expiration headers, caches use heuristics based on headers like "Last-Modified" to guess at a suitable expiration. Heuristic methods are inefficient compared to expiry dates set by humans who know a page's content and frequency of changes.

"Since heuristic expiration times might compromise semantic transparency, they ought to be used cautiously, and we encourage origin servers to provide explicit expiration times as much as possible." RFC 2616


Notes about caching

The HTTP/1.1 specification (section 13) has strict requirements for caches: It requires them to provide semantic transparency--returning a cached response should provide the same data as would be retrieved from the origin server; and it calls for them to read freshness requirements provided by origin servers and clients.

Caches must pass on warnings provided by upstream caches or the origin server, and they must add warnings if providing a stale response. A cache may provide a stale response in limited circumstances, mostly if the cache cannot connect to the origin server and the client has stated that it will accept a stale response.

If a cache receives a request for a stale page, it sends a validation request to ask the origin server if the page has changed. The most common validation tool is the last modification time. If there are two changes stored within one second, Last-Modified will be incorrect. Because of this, HTTP/1.1 offers strict validation using the Entity Tag header.

The simplest way to assist caching is to keep accurate time on your HTTP server and always send the Date and Last-Modified headers with your responses.

To be a really cache-friendly webmaster, though, include the cache headers in your pages.


Setting up cache headers in Apache

Main method: Expires header
To use the Expires header, you will need to be running Apache 1.2 or later and have mod_expires enabled. Uncomment the expires_module line in the "Dynamic Shared Object Support" section of the httpd.conf file, then recompile Apache.

#
# Dynamic Shared Object (DSO) Support
#

# LoadModule cern_meta_module /usr/lib/apache/1.3/mod_cern_meta.so
LoadModule expires_module /usr/lib/apache/1.3/mod_expires.so
(If you are running Apache 1.3 or later and it is configured to load modules at runtime, you can edit httpd.conf and then restart Apache instead of recompiling.)

mod_expires calculates the Expires header based on three directives. The directives apply to document realms and are usable in any of the following realms: "server config", "virtual host", "directory", or ".htaccess".
The Expires directives have two syntaxes. One is fairly unreadable; it expects you to calculate how many seconds until expiry. Fortunately, the module will also read a much more human syntax. This article describes the readable syntax.



The directives are:

ExpiresActive on|off
ExpiresDefault "<base> [plus] {<num>  <type>}*"
ExpiresByType type/encoding "<base> [plus] {<num> <type>}*"
base is one of:

access
now (equivalent to "access")
modification
num is an integer value that applies to the type. type is one of:

years
months
weeks
days
hours
minutes
seconds
If you're using the Expires directives for a server, virtual host, or directory, edit the httpd.conf file and add the directives inside those realms.

<Directory /whichever/directory/here>
    # Everything else you want to add to this section
    ExpiresActive on
    ExpiresByType image/gif "access plus 1 year"
    ExpiresByType text/html "modification plus 2 days"
    # ExpiresDefault "now plus 0 seconds"
    ExpiresDefault "now plus 1 month"
</Directory>
If you're using the Expires header in the .htaccess file, you will need to edit httpd.conf to set the AllowOverride header for the relevant directory. Apache will only read .htaccess in directories which have the "Indexes" override set.

# Allow the Indexes override for the directories using .htaccess.
<Directory /whichever/directory/here>
    # Everything else you want to add to this section
    AllowOverride Indexes
</Directory>
Add the Expires directives to the .htaccess file in the relevant directory. The webmaster can edit the .htaccess file without needing access to httpd.conf.


The main problem with the ".htaccess" method is that the Indexes override and the .htaccess file give the webmaster more configuration options than just the Expires header. This may not be what the system administrator intends.

Alternative method: Cache-Control header
mod_cern_meta allows file-level control, and it also allows the use of Cache-Control headers (or any other header). The headers are put in a subdirectory of the origin directory, with a name based on the origin file's name.


Uncomment the cern_meta_module line and recompile, as for expires_module in the last section.

In the httpd.conf file, set MetaFiles on, MetaDirectory to the subdirectory name, and MetaSuffix to a suffix for the header files.

MetaFiles on
MetaDirectory .web
MetaSuffix .meta
Using these values, the file /var/www/www.example.org/index.html would have a meta file at /var/www/www.example.org/.web/index.html.meta.

Any valid HTTP headers can be put in these files. This provides another way to apply the Expires header, and it's a way to add the Cache-Control headers. The relevant Cache-Control headers are:

Cache-Control : max-age = [delta-seconds]
Modifies the expiration mechanism, overriding the Expires header. Max-age implies Cache-Control : public.
Cache-Control : public
Indicates that the object may be stored in a cache. This is the default.
Cache-Control : private
Cache-Control : private = [field-name]
Indicates that the object (or specified field) must not be stored in a shared cache and is intended for a single user. It may be stored in a private cache.
Cache-Control : no-cache
Cache-Control : no-cache = [field-name]
Indicates that the object (or specified field) may be cached, but may not be served to a client unless revalidated with the origin server.
Cache-Control : no-store
Indicates that the item must not be stored in nonvolatile storage, and should be removed as soon as possible from volatile storage.
Cache-Control : no-transform
Proxies may convert data from one storage system to another. This directive indicates that (most of) the response must not be transformed. (The RFC allows for transformation of some fields, even with this header present.)
Cache-Control : must-revalidate
Cache-Control : proxy-revalidate
Forces the proxy to revalidate the page even if the client will accept a stale response. Read the RFC before using these headers, there are restrictions on their use.

Caveats and gotchas

HTTP/1.0 has minimal cache control and only understands the Pragma: no-cache header. Caches using HTTP/1.0 will ignore the Expires and Cache-Control headers.

None of the Cache-Control directives ensure privacy or security of data. The directives "private" and "no-store" assist in privacy and security, but they are not intended to substitute for authentication and encryption.

This article is not a substitute for the RFC. If your are implementing the Cache-Control headers, do read the RFC for a detailed description of what each header means and what the limits are.

melby@datasoft.ws

  • Blue Belt
  • ****
  • Posts: 197
  • Karma: +0/-0
Re: Cache-Friendly Web Pages
« Reply #1 on: October 17, 2013, 11:52:52 PM »
Hi,

 Ten Tips for Building Cache-Friendly Web Sites

Avoid using CGI, Active Server Pages, and server-side includes unless absolutely necessary.

Related Reading

Web Caching
By Duane Wessels

In general, these techniques are bad for caches because they usually produce dynamic content. Dynamic content is not a bad thing per se, but it may be abused. CGI and ASP can also generate cache-friendly, static content, but require special effort by the author and seem to occur infrequently in practice.

The main problem with CGI scripts is that many caches simply do not store a response when the URL includes cgi-bin or even cgi. The reason for this heuristic is perhaps historical. When caching was first in use, this was the easiest way to identify dynamic content. Today, with HTTP 1.1, we only need to look at the response headers to determine what may be cached. Even so, the heuristic remains, and some caches might be hardwired to never store CGI responses.

From a cache's point of view, Active Server Pages (ASP) are very similar to CGI scripts. Both are generated by the server, on the fly, for each request. As such, ASP responses usually have neither a Last-Modified nor an Expires header. On the plus side, it is uncommon to find special cache heuristics for ASP (unlike CGI), probably because ASP was invented well after caching was in widespread use.

Finally, you should avoid server-side includes (SSI) for the same reasons. This is a feature of some HTTP servers to parse HTML at request time, and replace certain markers with special text. For example, with Apache you can insert the current date and time or the current file size into an HTML page. Because the server generates new content, the Last-Modified header is either absent in the response, or set to the current time. Both cases are bad for caches.

Use the GET method instead of the POST method, if possible.

Both methods are used for HTML forms and query-type requests. With the POST method, query terms are transmitted in the request body. A GET request, on the other hand, puts the query terms in the URI (Uniform Resource Identifier). It's easy to see the difference in your browser's Location box. A GET query has all the terms in the box, with lots of & and = characters. This means POST is somewhat more secure because the query terms are hidden in the message body.

However, this difference also means that POST responses cannot be cached unless specifically allowed. POST responses may have side effects on the server (e.g., updating a database), but those side effects wouldn't be triggered if the cache gave back a cached response. Section 9.1 of RFC 2616 explains the important differences between GET and POST. In practice, it is rare to find a cachable POST response, so I doubt most caching products even cache any POST responses at all. If you want to have cachable query results, you certainly should use GET instead of POST.

Avoid renaming Web site files; use unique filenames instead.

This might be difficult or impossible for some situations, but consider this example: A Web site lists a schedule of talks for a conference. For each talk there is an abstract stored in a separate HTML file. These files are named to match the order of their presentation during the conference. Something like talk01.html, talk02.html, talk03.html, and so on. At some point, the schedule changes and the filenames are no longer in order. If the files are renamed, so that they match the new order of the presentation, Web caches are likely to become confused. Renaming usually does not update the file-modification time, so an If-Modified-Since request for a renamed file can have unpredictable consequences. Renaming files in this manner is similar to cache poisoning.

In this example, it is better to use a file-naming scheme that does not depend on the order; perhaps base the file naming on the presenter's name. Then, if the order of presentation changes, the HTML file with the schedule must be rewritten, but the other files can still be served from the cache. Another solution is to touch the files to adjust the time stamp.

Give your content a default expiration time, even if it is very short.

If your content is relatively static, adding an Expires header can significantly speed up access to your site. The explicit expiration time means clients know exactly when they should issue revalidation requests. An expires-based cache hit is almost always faster than a validation-based near hit.

With Apache, you can use the mod_expires module to add expiration times to your responses. After configuring and compiling the server with mod_expires, you'll need to add the ExpiresActive directive to your httpd.conf file:

       ExpiresActive on

Then, you can use the ExpiresByType and ExpiresDefault directives to control expiration values for different responses. For example:

       ExpiresByType text/html "access plus 12 hours"
       ExpiresByType image/jpeg "access plus 1 day"
       ExpiresDefault "access plus 6 hours"

If you have content that changes at regular intervals (for example, daily), you can base the expiration time on the file-modification time:

       ExpiresByType image/gif "modification plus 1 day"

For more information on the mod_expires module, take a look at the Apache documentation.

If you have a mixture of static and dynamic content, it is helpful to have a separate HTTP server for each.

This way, you can set server-wide defaults to improve the cachability of your static content, without affecting the dynamic data. Since the entire server is dedicated to static objects, you only need to maintain one configuration file. A number of large Web sites have taken this approach. Yahoo serves all its images from a server at images.yahoo.com, as does CNN with images.cnn.com. Wired serves advertisements and other images from static.wired.com, and Hotbot uses a server named static.hotbot.com.

Don't use content negotiation.

Occasionally, people like to create pages that are customized for the user's browser. For example, Netscape may have a nifty feature that Internet Explorer does not have. An origin server can examine the User-agent request header and generate special HTML to take advantage of a browser feature. To use the terminology from HTTP, an origin server may have any number of variants for a single URI. The mechanism for selecting the most appropriate variant is known as content negotiation, and it has negative consequences for Web caches.

First of all, if either the cache or the origin server does not correctly implement content negotiation, a cache client might receive the wrong response. For example, if an HTML page has something specific to Internet Explorer and gets cached, the cache might send it to a Netscape user. To prevent this from happening, the origin server is supposed to add a response header telling caches that the response depends on the User-agent value:

       Vary: User-agent

If the cache ignores the Vary header, or if the origin server does not send it, cache users can get incorrect responses.

Even when content negotiation is correctly implemented, it reduces the number of cache hits for the URL. If a response varies on the User-agent header, a cache must store a separate response for every User-agent it encounters. Note, this is more than just Netscape or MSIE. Rather, it is a string like Mozilla/4.05 [en] (X11; I; FreeBSD 2.2.5-RELEASE i386; Nav). Thus, when a response varies on the User-agent header, we can only get a cache hit for clients running the exact same version of the browser, on the same operating system.

Synchronize your system clocks with a reference clock.

This ensures that your server sends accurate Last-modified and Expires time stamps in its responses. Even though newer versions of HTTP use techniques that are less susceptible to clock skew, many Web clients and servers still rely on the absolute time stamps. ntpd implements the Network Time Protocol (NTP) and is widely used to keep clocks synchronized on Unix systems. You can get the software and installation tips from the Time Synchronization Server Web site.

Avoid using address-based authentication.

Most caching proxies hide the addresses of clients. An origin server sees connections coming from the proxy's address, not the client's. Furthermore, there is no standard and safe way to convey the client's address in an HTTP request.

Address-based authentication can also deny legitimate users access to protected information when they use a proxy cache. Many organizations use a DMZ network for the firewall between the Internet and their internal systems. A cache that runs on the DMZ network is probably not allowed to access internal Web servers. Thus, the users on the internal network cannot simply send all of their requests to a cache on the DMZ network. Instead, the browsers must be configured to make direct connections for the internal servers.

Think Different.

Sometimes, those of us in the United States forget about Internet users in other parts of the world. In some countries, Internet bandwidth is so constrained that we would find it appalling. What takes seconds or minutes to load in the U.S. may take hours or even days in some locations. I strongly encourage you to remember bandwidth-starved users when designing your Web sites, and remember that improved cachability speeds up your Web site for such users.

Even if you think shared proxy caches are evil, consider allowing single-user-browser caches to store your pages.

There is a simple way to accomplish this with HTTP 1.1. Just add the following header to your server's replies:

Thanks,
Melby.

silgy

  • Green Belt
  • *****
  • Posts: 224
  • Karma: +1/-0
Re: Cache-Friendly Web Pages
« Reply #2 on: October 18, 2013, 03:03:16 AM »
Hello...

Benefits of Web Caching
When a request is satisfied by a cache (whether it's a browser cache, or a proxy cache run by your organization or ISP), the content no longer has to travel across the Internet from the origin Web server to you, saving bandwidth for both the client or ISP as well as the origin site. Instead of taking the time to establish new connections with each origin server for each resource needed, the client can retain its connection to the proxy. This is particularly beneficial to clients behind high-latency connections like modems or satellites. Furthermore, sufficiently busy proxies may be able to send requests for multiple clients on the same connection to one server, thus saving on connection establishments to servers, as well. Recall that TCP has a fairly high overhead (in terms of time) for connection establishment, and that it sends data slowly at first. This, combined with the fact that most requests on the Web are for relatively small resources, means that reducing the number of necessary connections and holding them open so that future requests can use them (that is, making them persistent) has a significant, positive effect on client performance.
The cacheability of your Web site affects both its user-perceived performance and the scalability of your hosting solution. Instead of taking seconds or minutes to load, a cached page can appear almost instantaneously. And whether you spend $20 a month for a starter Web site that allows 1GB of transfer per month, or many thousands of dollars per month for high-end connectivity and multiple servers, a cache-friendly design will let you serve more pages before you'll need to upgrade to a more expensive solution. To extract the most performance for the dollar, you should make your site as cache-friendly as possible-read on

Thanks
Silgy