Apache config and .htaccess - URL rewriting

From Helpful
Jump to: navigation, search
Related to web development, hosting, and such: (See also the webdev category)
jQuery: Introduction, some basics, examples · plugin notes · unsorted

Server stuff:

Dynamic server stuff:

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


Intro

When you want to point a browser to see some different contents, options include:

  • Scripts that emit the redirect headers
  • Apache-configured static redirects with mod_alias's Redirect, for example:
  • The regex-rewriting variation, RedirectMatch
  • For things RedirectMatch can't do, you'll want mod_rewrite, which is more configurable and flexible.


Note that most of these can do internal redirects, external redirects, or both.

External redirects mean telling the browser to do a new request elsewhere elsewhere (a given URL), while internal redirects cause the server to do an internal request elsewhere (this is invisible to the UA), and serve as if you requested something else.


Internal redirects may serve the one request faster than a "no, look there instead" answer that leads to another request, but external redirects are often preferred because of search engines - because they avoid the same content being served under multiple URLs (a little impractical. Also, SEO freaks are fond of this reason because of pageranky reasons).


Apache modules that rely on request or response values may be affected by rewriting, as may script logic - consider the possible interactions when writing redirect logic.

mod_alias

Basic redirect matches and replaces a string prefix. For particular pages you can do:

Redirect permanent /app.html http://app.example.com/
Redirect permanent /foo.html /new/foo/foo.php


Because any path that is unmatched is appended to the redirect's target, beware of what that means for you. Say you have an old site you want to redirect to a new site's front page.

Redirect permanent / http://www.example.com/sub/index.html

...isn't what you want - sourcehost.example.com/test.html would result in going to http://www.example.com/sub/index.htmltest.html (verify)


You may want to swallow that input, for example using:

RedirectMatch permanent /(.*) http://www.example.com/sub/index.html


If you want to selectively remap based on patterns rather than directories, regexps may be friendly. Consider:

RedirectMatch (.*)\.jpg$ http://static.example.com$1.jpg


The redirect type (see also HTTP_notes#HTTP_redirect) is one of

  • permanent, referring to permanent moves (HTTP 301 status)
  • temp, referring to temporary redirects (302)
  • seeother, referring to a 303

There is a fourth option, gone, but it is really an error page trigger and not a redirect. It takes no URL.

mod_rewrite

mod_rewrite allows much funkier things. You may sometimes need it when the more basic apache redirect rules don't cut it.


(Side note: When you use it, you may want to enable it only on vhosts/dirs where you actually use it, simply because of speed reasons: mod_rewrite hooks into a phase that applies to all requests, and even doing nothing takes some time. This is why you see a bunch of mentions of RewriteEngine on in examples. I'm going to assume you now this now and ignore that enabler line in all examples below)


To use mod_rewrite, you write a list of RewriteRules and RewriteConditions in each context you want rewriting. Their syntax:

RewriteCond tested_string condition_pattern [CondFlags]
# and
RewriteRule url_pattern url_substitution [RuleFlags]

RewriteCond primarily tests a request's environment, while RewriteRule primarily applies the rewrite on the request URL. Both act as filters, in different ways, and both may terminate rewrite processing(verify).

Both RewriteCond and RewriteRule can match using regexps, and both can capture values. A RewriteRule can also refer to a capture in a RewriteCond, but this is often not necessary so rarely used. See the hostname rewriting example below.

You can enable logging of the rewrite operations (see RewriteLog and RewriteLogLevel), which is useful to debug your rewrite rules (but which you want disabled in production, for speed reasons).


Note that most anything mod_rewrite can do can also be done by dynamic scripting, and sometimes actually using such a script is a lot more flexible, or at least the rewrite code more readable.


RewriteCond

Condition flags are:

  • [OR] chains this condition with the next condition in an OR-ing way (instead of the default AND)
  • [NC] makes text matching case insensitive


You test against any string, often from the environment using the form %{VARNAME}. The environment can be augmented by you, but includes at least the following:

  • Request details: REQUEST_URI, REMOTE_HOST, REMOTE_ADDR, REQUEST_METHOD (as in GET, POST, etc.), REQUEST_URI, QUERY_STRING, and others
    • HTTP headers: HTTP_REFERER, HTTP_USER_AGENT, HTTP_COOKIE, HTTP_HOST, HTTP_ACCEPT and others
  • Server details and configuration: SERVER_NAME, SERVER_ADDR, SERVER_PORT, SERVER_ADMIN, SERVER_PROTOCOL, SERVER_SOFTWARE, SERVER_VERSION, DOCUMENT_ROOT, API_VERSION, and others
  • Various time variables (TIME_YEAR, TIME_MON, TIME_DAY, TIME_HOUR, TIME_MIN, TIME_SEC, TIME_WDAY)
  • (verify): THE_REQUEST, REQUEST_FILENAME, IS_SUBREQ
  • Others, in specific cases. For example, doing internal redirects adds a variable or two.
  • Anything you may have configured to be set, such as SetEnvIf rules.


RewriteRule

Tests against the current URL's path (so no, not the hostname or the query string - you need RewriteCond for those)

You usually restructure an URL with parts from the old one using regexp capturing, referencing groups with $1 and such. You can add things from the environment (%{}) in the rewritten URL if you wish.


RewriteRule flags include the following. Note that most people use the single-letter form, but spelling them out may be more readable.

Processing/logic:

  • nocase, NC: case insensitive URL match
  • noescape, NE: don't apply URL escaping to the rewrite part
  • next, N: restart rewriting (with URL as it current is rewritten). Take halting-problem-like care:)
  • chain, C: When you chain a number of rules, one being false skips the rest of the chain
  • skip=n, S=n: skip the following n rules. Can be used to imitate if-then-else
  • last, L: last entry: Stop processing the list.
  • forbid, F: forbid access: Send a 403 page
  • redirect, R: an external redirect (with an optional HTTP code; default is 302)

Change handling/content (these all take arguments; read up on their use):

  • type, T: set MIME type
  • env, E: set environment variable
  • qsappend, QSA: add data to URL query string
  • cookie, CO: add a cookie

Unsorted:

  • passthrough, PT
  • proxy, P

Some examples

Simple redirect to page elsewhere

The simplest example is perhaps rewriting one page to another:

RewriteRule ^alice.html$ bob.html      
 #has the same effect as:
Redirect temporary alice.html bob.html

The difference is that avoiding using mod_rewrite is less heavy on your server.

Note: If you're matching directories, you may want to do ^/dirname/?$ to be robust to requests omitting the final slash.

Site reorganization

When reorganizing sites, most people just break people's bookmarks.

You can avoids this by adding permanent redirects from all known old locations.

This and other options are sometimes served better - or just easier - by some dynamic, database-backed script or mod_rewrite logic emitting the right headers, because they can handle many cases at once.


Logging select requests

It may be interesting to log things like attempted attacks:

RewriteRule  .*cmd\.exe    /attacklog.php?what=nimda
RewriteRule  .*root\.exe   /attacklog.php?what=nimda
RewriteRule  .*Admin\.dll  /attacklog.php?what=nimda

Note that in some cases you may have a custom 404 handler script for a whole site, and it's easier to handle this sort of thing in its logic.

Server redirecting/proxying (internal)

In, say, a VirtualHost with very little else:

RewriteEngine on 
RewriteRule ^(.*) http://127.0.0.1:8080$1 [P]

That proxy flag actually works by handing over control to mod_proxy, so this is similar to reverse proxy configuration.


If apache complains that you made an "attempt to make remote request from mod_rewrite without proxy enabled", this may well be because the loading of mod_proxy's is wrapped in a <IfDefine PROXY>, and you haven't configured apache to add -D PROXY to wherever such definitions go (/etc/conf.d/apache2 on my system) to cause apache to actually load mod_proxy.


This can be handy to redirect to standalone webservers, FastCGI-style apps and/or hide port numbers from your URLs - hence the internal proxy.

Note that if the proxied application writes abosolute URLs including the hostname, it may not make self-references well. You probably want to read up on using the ProxyPreserveHost directive, and/or on the X-Forwarded-Host and X-Forwarded-FOR headers.

Details often depend on how flexible/smart the backing app is.


Reverse proxy

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Acting as a web server for content elsewhere. The server will grab that content and present it as if it is generating it. (for contrast: a forward proxy is one client handling another client's request, often for transparent caching or anonymity - see also Cache_and_proxy_notes#Proxy)


See mod_proxy. Note that

  • in some cases you will be selectively relaying some requests to different servers
  • in others you may want a reverse proxying process in front your real web servers, rather than as a part of one.

(You can also look at things like pound, nginx, and such. They are generally faster at this job)


For example, consider a host that wants to hand off requests to URLs under /img and /app:

ProxyRequests Off  #not necessary for a *reverse* proxy
ProxyPass /img/ http://internal.example.com/img/
ProxyPass /app/ http://internal.example.com/app/

(note: you can hand options into ProxyPass that control connection limits and simple load balancing, though this proxying is probably not efficient enough for large sites)


Problems

ProxyPass results in a very simple pass-through of data, so web server self-references will easily be wrong. This includes:

  • External redirects (Location header)
    • including the slashless-to-slashful URL that web servers regularly use for directories
  • (absolute) paths in URLs in the content (unless the directory structure is the same on both hosts)
  • hostnames in URLs in the content (including protocol and host name, necessary e.g. in HTTP→HTTPS redirects)
  • paths in cookies, and hosts/domains in cookies, for the same reasons


Depending on the details of your proxypass rule, you may also create a trailing-slash problem.


Solutions

There are different solutions to different parts.

  • ProxyPassReverse is for redirects (and other header-related issues) - it makes the reverse proxy utility rewrite the Location, Content-Location, and URI headers. In the example above, the following rule works for both proxy-passes:
ProxyPassReverse / http://internal.example.com/


As to paths/hostnames in the HTML content:

If you are proxying apps that you can't easily make proxy-aware, you can get a works-mostly workaround by using an apache output filter that parses and rewrites the generated HTML: mod_proxy_html (a fast SAX parser, aware of HTML 4 and XHTML 1).

In general, however, you want to write your apps relative-directory-aware (which is usually quite simple and just a matter of a little consistency and a little rewriting) and reverse-proxy-host-aware, which is a little harder.


There is the ProxyPreserveHost directive, which sends the (proxying side's) Host: line for the proxy request instead of the one you specified in ProxyPass.

This makes the most sense when it is the backend server that does every evaluation necessary, and the proxyier doesn't.


Alternatively, or additionally, you can make your proxier add some headers. The three generally relevant:

  • X-Forwarded-For: the actual client's IP address (so that a proxied app can tell real client apart from the more immediate client, the proxy) [1]
  • X-Forwarded-Host: The original request's Host:, usually so that the application can tell without the proxy having to touch the internal request's Host: header to that end.
  • X-Forwarded-Server: The proxy's hostname


Trailing slash

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

To demonstrate the problem:

Given a HTML page served at the browser-visible URL
http://example.com/proxiedapp
, if that HTML page contains a relative link, say,
<tt>admin/foo.html</tt>
, that will resolve to
http://example.com/admin/foo.html
and not to the probably intended
http://example.com/proxiedapp/admin/foo.html
.

This is correct according to path semantics -- proxiedapp is not the (virtual or real) directory you think it is or want it treated as. If you want it treated that way, you need a final slash.

This is the reason web serves usually do such a redirect automatically.


...in proxied apps
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

This problem can also appear in reverse-proxied apps. It seems to appear in a strange half-working way, in fact: The proxy logic will recognize /proxiedapp and send a sensible internal path to the app(verify), so that the index does get loaded correctly, so the application will appear broken if it makes any relative references -- which, somewhat ironically, it should to work correctly behind a proxy in the first place.

Since the internal application cannot easily test for that slash (it does not know the browser URL it is visible under) and external redirects must include a full url, the proxied application itself cannot fix this using only proxied-app logic and HTTP. You could hack in a solution involving hardcoding, scripting and/or HTML redirects, but a cleaner solution is to set up the proxying correctly.


Apache's usually automatic redirects from slashless to slashful is not applied automatically since it doesn't know/assume anything about these paths.

A redirect at the proxying side is generally preferred over fixups on the proxied side.

Generally, the only problem case is the base URL for the app when it is missing the slash, so a common solution is to use mod_rewrite to fix exactly that one case:

RewriteEngine on
RewriteRule ^/proxiedapp$ proxiedapp/ [R]

Hostname rewriting

You could catch people adding "www." when they shouldn't:

<VirtualHost 192.0.2.141>
  ServerName  www.thing.example.com
  Redirect permanent /  http://thing.example.com/
</VirtualHost>
 
<VirtualHost 192.0.2.141>
  ServerName  thing.example.com
  #The actual vhost: documentroot, etc.
</VirtualHost>

This shows you don't need mod_rewrite for this unless you want to do flashy things.


One such flashy thing (which you can still do with a script instead of mod_rewrite) is using DNS wildcard subdomains. Consider dict.die.net - an url like http://foo.dict.die.net will cause an external redirect to http://dict.die.net/foo

This works at all because DNS allows wildcard subdomains (*), which means anything under a particular domain resolves as the same thing. The request ends up at the server, and since the browser reports the hostname as per HTTP1.1, the server can look at it and rewrite the URL accordingly.

Say you want:

  • www.example.com be the host you want
  • example.com to redirect to www.example.com
  • anythingelse.example.com to redirect to www.example.com/anythingelse

One solution would be:

# This logic should be placed in the default vhost, since it is the place that
# requests with hostnames we don't explicitly vhost end up in
 
#Add the www if someone forgot it - like the fake vhost example above
RewriteCond %{HTTP_HOST} ^example.com$
RewriteRule (.*) http://www.example.com$1 [L,R=301]
 
#anything that isn't www.
RewriteCond %{HTTP_HOST} !^www.example.com$
#gets through to here, so we can capture it:
RewriteCond %{HTTP_HOST} ^(.*).example.com$
RewriteRule ^[/]?(.*)$ http://www.example.com/%1/$1 [L,R=301]

This would redirect moo.example.com/bah to www.example.com/moo/bah. You could of course also rewrite it to something like www.example.com/complainabouthostname.php?host=%1&fullhost=%{HTTP_HOST}&path=$1 or some such thing, that depends purely on what further handling you have in mind.

Note the distinction between %1 and $1: the former matches out of a RewriteCond, the latter out of the RewriteRule.


Filter out Bad Things

It is fairly common to see files with loads [NC,OR]ed RewriteConds in a row to deny:

  • by HTTP_USER_AGENT, particularly site harvesters and bad bots
  • by HTTP_REMOTE_ADDR, often zealously blocking bad bots by IP
  • by HTTP_REFERER to turn referrer spam into rejections
  • REQUEST_URI to identify attacks/virus scans


Nicer URLs

You can intentionally rewrite URLs, as a feature. For example, an out-of-the-box mediawiki has URLs that look like: http://example.com/wiki/index.php/Foo, but you wikipedia and such allow you to visit pages like http://example.com/wiki/Foo.

This is done by rewriting the second URL as if you'ld visited the first. This should be an internal rewrite (serving as if the client requested the rewritten URL) rather than an external one (telling the browser to go elsewhere).

For a simpler example (in the case of mediawiki there is some interaction with pages that should not have part of their URLs stripped), consider rewriting /news/2007/01/01/0001 to /news.php?id=2007010100001

RewriteRule ^news/([0-9]+)/([0-9]+)/([0-9]+)/([0-9]+) /news.php?id=$1$2$3$4

or perhaps /widgets/playlist/blue/ to /version2/playlist.php?color=blue

RewriteRule ^widgets/playlist/(.*) /version2/playlist.php?color=$1


RewriteMap uses

RewriteMap allows you to read key-value pairs from files (plain text or simple file based database formats), or even call an application to do the rewrite logic, allowing you to use data external to apache.

If performance is important, note that using scripting can help prevent IO load (not PHP unless you have a memcache).


Simple load balancing

You can use RewriteMap to specify a list of alternative values for a host to redirect to. (It won't scale as well as hardware balancing as all requests are probably still going to a main server you set this in, but if requests are heavy it's a nice start)

See http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html#rewritemap

Note that using mod_backhand is often a better way of load balancing.




Hotlinking

Testing for hotlinking tests whether the contents of the Referer: header (a misspelling of referrer, standardized in HTTP, and used with both spellings in discussions) are something you allow or not - usually '...come from the my domain.'

This header is set by browsers to the place that led to this request, usually:

  • when visiting a new page based on a clicked link (referrer is set to the page the link was on), and
  • for page-embedded content (referrer is set to the page the media is on)

If you add rules that require access to your media to have a referrer in the same domain and others you approve, other people embedding to and/or directly linking to the image won't see it.

It's possible to spoof and there is really no way for the server to see whether the UA is lying, but for spoofing to happen, the browser has to be forced. Only power users and random geeks will/can do this.


For example:

RewriteEngine on
 
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain.com/.*$ [NC]
RewriteRule \.(gif|jpg)$ - [F]

The above roughly means:

  • If Referer: is present (not an empty string) (redundant in this case)
  • AND (without mentioning a boolean, AND applies. If you want ORs, write [OR])
  • If Referer: doesn't have this domain ([NC] means case insensitive match)
  • If URL ends with ".gif" or ".jpg", [F]orbid it - that is, trigger apache's '403 Forbidden' page.

It's not uncommon to to do an (external) redirect to a "This image is leeched" image URL.


'Image not found' image

Instead of letting browsers show broken images (or having them disappear), you could rewrite to an image that says 'missing image'

You can tell RewriteCond to check whether the URL is responsive (-U) or whether file serving would work (-F), for example:

RewriteCond %{REQUEST_URI} !-U  # request won't work?
RewriteCond %{REQUEST_URI} !-U  # request won't work?
RewriteRule ^catalog_image_path/.*\.gif$

This is done with an internal subrequest, so if you care about performance you may not want this, or want it only under a fairly strict RewriteCond.

You may prefer to write a (local) custom 404 handler script to do the same, and possibly more.


Restrict user agents

While it is simpler to use robots.txt, if they don't honor it (or to finally re-fetch it and put it into effect) but still accurately report their user agent string.

Example:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^SomeBadSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow    [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archive    [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver
RewriteRule ^.*$ /badrobot.html [F,L]

(L, incidentally, forces this to be the [L]ast applied rule to the request)

Basically, if any of these user agents apply, you give them a particular page (a proper error may be better).


Another way of doing this is doing such tests without mod_rewrite by using enviroment variables (example lines from & thanks to this page):

SetEnvIf       User-Agent  CherryPicker               BAD_BOT
SetEnvIf       Request_URI ^/default\.ida             BAD_BOT=worm
SetEnvIfNoCase Referer     ^http://(www\.)?xopy\.com  BAD_BOT=spammer
#Then deny based on that variable
Order Allow,Deny
Allow from all
Deny from env=BAD_BOT
#Even doing further tests based on variables set earlier. 
# You can even filter things from your logs. First you set NOLOG for the cases you want to avoid
# (for example only when BAD_BOT is set to spammer)
SetEnvIf BAD_BOT spammer NOLOG
# Then you use a CustomLog entry (make sure the log path makes sense for your system!) 
#  with a condition like the following added:
CustomLog /var/httpd/logs/access.log combined env=!NOLOG


Having such environment variables also allow you to ignore the corresponding log entries, or put them in a separate log file:

See also