Apache URL rewriting
From Helpful
| These are primarily notes This is probably not going to be complete in any real sense, and exists to contain bits of useful information. |
| This article is protected, probably because it was spammed repeatedly. To edit protected pages, bug me to try to make the spam protection better, or register and contact me. |
| See also Category:Apache. Some of the interesting articles: |
Intro, mod_alias
When the goal is pointing a browser to some different contents, you have a few options. They include:
- Scripts that emit the redirect headers (effectively static or dynamic)
- Apache-configured static redirects with mod_alias' Redirect, for example:
Redirect permanent /app.html http://app.example.com/ Redirect permanent /foo.html /new/foo/foo.php
- The regex-rewriting version o the last, RedirectMatch, for example:
RedirectMatch (.*)\.gif$ http://static.example.com$1.jpg
- For things RedirectMatch doesn't manage, you'll want mod_rewrite, which is more configurable and flexible.
Note that many things can do internal redirects, external redirects, or both.
External redirects mean telling the browser to do a new request elsewhere elsewhere (a given URL), while internal redirects cause the server to do an internal request on the new location, and serve serve as if you requested something else (under the original URL).
Apache modules that rely on request or response values may be affected by rewriting, as may script logic - consider the possible interactions when writing redirect logic.
The redirect type (see also HTTP_notes#HTTP_redirect) is one of
- permanent, referring to permanent moves (HTTP 301 status)
- temp, referring to temporary redirects (302)
- seeother, referring to a 303
There is a fourth option, gone, but it is really an error page trigger and not a redicect. It takes no URL.
Note that Redirect works through prefix matching and copying, so:
Redirect permanent / http://www.example.com/sub/index.html
...will probably not do what you want, because any path used in the redirected-from URL will be added to te path of the redirected-to URL. In this example, you would probably want to swallow tat input, e.g. with:
RedirectMatch permanent /(.*) http://www.example.com/sub/index.html
mod_rewrite
When using mod_rewrite, you want to enable it only where you use it, for speed reasons. For this reason, RewriteEngine on appears in most examples. I'm going to assume you just read and therefore know this now.
To use mod_rewrite, you write a list of RewriteRules and RewriteConditions in each context you want rewriting.
Their syntax:
RewriteCond tested_string condition_pattern [CondFlags] #and RewriteRule url_pattern url_substitution [RuleFlags]
RewriteCond mostly tests a request's environment, while RewriteRule mostly does the actual rewrite logic on the request URL. Both act as filters in different ways, and both may terminate processing(verify).
Both Cond and Rule can match using regexps, and both can capture values - you don't see this used very often but a rewriterule can also refer to a capture in a rewritecond. See the hostname rewriting example below.
You can enable logging of the rewrite operations (see RewriteLog and RewriteLogLevel) which is handy to debug your rewrite rules.
Note that most anything mod_rewrite can do can also be done by a script - environment testing is usually simple. If you want logic, scripts are often easier to write than mod_rewrite logic.
RewriteCond
Condition flags are:
- [OR] chains this confition with the next using an OR; the default is AND
- [NC] makes text matching case insensitive
You test against any string, often from the environment using the form %{VARNAME}. The environment includes at least the following:
- Request details: REQUEST_URI, REMOTE_HOST, REMOTE_ADDR, REQUEST_METHOD (as in GET, POST, etc.), REQUEST_URI, QUERY_STRING, and others
- HTTP headers: HTTP_REFERER, HTTP_USER_AGENT, HTTP_COOKIE, HTTP_HOST, HTTP_ACCEPT and others
- Server details and configuration: SERVER_NAME, SERVER_ADDR, SERVER_PORT, SERVER_ADMIN, SERVER_PROTOCOL, SERVER_SOFTWARE, SERVER_VERSION, DOCUMENT_ROOT, API_VERSION, and others
- Various time variables (TIME_YEAR, TIME_MON, TIME_DAY, TIME_HOUR, TIME_MIN, TIME_SEC, TIME_WDAY)
- (verify): THE_REQUEST, REQUEST_FILENAME, IS_SUBREQ
- Others, in specific cases. For example, doing internal redirects adds a variable or two.
- Anything you may have configured to be set, such as SetEnvIf rules
RewriteRule
Tests against the current URL that is, the path part. It excludes things like the hostname and the query string - you use RewriteCond for those.
You usually restructure an URL with parts from the old one using regexp capturing, referencing groups with $1 and such. You can add things from the environment (%{}) in the rewritten URL if you wish.
RewriteRule flags include the following. Note that most people use the single-letter form, but spelling them out may be more readable.
Processing/logic:
- nocase, NC: case insensitive URL match
- noescape, NE: don't apply URL escaping to the rewrite part
- next, N: restart rewriting (with URL as it current is rewritten). Take halting-problem-like care:)
- chain, C: When you chain a number of rules, one being false skips the rest of the chain
- skip=n, S=n: skip the following n rules. Can be used to imitate if-then-else
- last, L: last entry: Stop processing the list.
- forbid, F: forbid access: Send a 403 page
- redirect, R: an external redirect (with an optional HTTP code; default is 302)
Change handling/content (these all take arguments; read up on their use):
- type, T: set MIME type
- env, E: set environment variable
- qsappend, QSA: add data to URL query string
- cookie, CO: add a cookie
Unsorted:
- passthrough, PT
- proxy, P
Some examples
Simple redirect to page elsewhere
The simplest example is perhaps rewriting one page to another:
RewriteRule ^alice.html$ bob.html #has the same effect as: Redirect temporary alice.html bob.html
The difference is that avoiding using mod_rewrite is less heavy on your server.
Note: If you're matching directories, you may want to do ^/dirname/?$ since the slash may not appear.
Site reorganization
When reorganizing sites, most people just break people's bookmarks.
You can avoids this by adding permanent redirects from all known old locations.
This and other options are sometimes served better - or just easier - by some dynamic, database-backed script or mod_rewrite logic emitting the right headers, because they can handle many cases at once.
Logging select requests
It may be interesting to log things like attempted attacks:
RewriteRule .*cmd\.exe /attacklog.php?what=nimda RewriteRule .*root\.exe /attacklog.php?what=nimda RewriteRule .*Admin\.dll /attacklog.php?what=nimda
However, as these will also be 404's, it may be handier to do this with a custom 404 handler script.
Server redirecting/proxying (internal)
In, say, a VirtualHost with very little else:
RewriteEngine on RewriteRule ^(.*) http://127.0.0.1:8080$1 [P]
That proxy flag actually works by handing over control to mod_proxy, so this is similar to reverse proxy configuration.
If apache complains that you made an "attempt to make remote request from mod_rewrite without proxy enabled", this may well be because the loading of mod_proxy's is wrapped in a <IfDefine PROXY>, and you haven't configured apache to add -D PROXY to wherever such definitions go (/etc/conf.d/apache2 on my system) to cause apache to actually load mod_proxy.
This can be handy to redirect to standalone webservers, FastCGI-style apps and/or hide port numbers from your URLs - hence the internal proxy.
Note that if the proxied application writes abosolute URLs including the hostname, it may not make self-references well. You probably want to read up on using the ProxyPreserveHost directive, and/or on the X-Forwarded-Host and X-Forwarded-FOR headers.
Details often depend on how flexible/smart the backing app is.
Reverse proxy
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
See mod_proxy. Note that in some cases you will be selectively relaying some requests to different servers, while in thers you may want a reverse proxying process in front your real web servers, rather than as a part of one. (You can also look at things like pound)
For example, consider a host that wants to hand off requests to URLs under /img and /app:
ProxyRequests Off #not necessary for a *reverse* proxy ProxyPass /img/ http://internal.example.com/img/ ProxyPass /app/ http://internal.example.com/app/
(note: you can hand options into ProxyPass that control connection limits and simple load balancing, though it is probably not enough for large sites)
Problems
ProxyPass results in a very simple pass-through of data, so web server self-references will easily be wrong. This includes:
- External redirects (Location header)
- including the slashless-to-slashful URL that web servers regularly use for directories
- (absolute) paths in URLs in the content (unless the directory structure is the same on both hosts)
- hostnames in URLs in the content (including protocol and host name, necessary e.g. in HTTP→HTTPS redirects)
- paths in cookies, and hosts/domains in cookies, for the same reasons
Depending on the details of your proxypass rule, you may also create a trailing-slash problem.
Solutions
There are different solutions to different parts.
- ProxyPassReverse is for redirects (and other header-related issues) - it makes the reverse proxy utility rewrite the Location, Content-Location, and URI headers. In the example above, the following rule works for both proxy-passes:
ProxyPassReverse / http://internal.example.com/
- ProxyPassReverseCookieDomain and ProxyPassReverseCookiePath are for correcting cookies in a similar way
As to paths/hostnames in the HTML content:
If you are proxying apps that you can't easily make proxy-aware, you can get a works-mostly workaround by using an apache output filter that parses and rewrites the generated HTML: mod_proxy_html (a fast SAX parser, aware of HTML 4 and XHTML 1).
In general, however, you want to write your apps relative-directory-aware (which is usually quite simple and just a matter of a little consistency and a little rewriting) and reverse-proxy-host-aware, which is a little harder.
There is the ProxyPreserveHost directive, which sends the (proxying side's) Host: line for the proxy request instead of the one you specified in ProxyPass.
This makes the most sense when it is the backend server that does every evaluation necessary, and the proxyier doesn't.
Alternatively, or additionally, you can make your proxier add some headers. The three generally relevant:
- X-Forwarded-For: the actual client's IP address (so that a proxied app can tell real client apart from the more immediate client, the proxy) [1]
- X-Forwarded-Host: The original request's Host:, usually so that the application can tell without the proxy having to touch the internal request's Host: header to that end.
- X-Forwarded-Server: The proxy's hostname
Trailing slash
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
To demonstrate the problem:
Given a HTML page served at the browser-visible URLThis is correct according to path semantics -- proxiedapp is not the (virtual or real) directory you think it is or want it treated as. If you want it treated that way, you need a final slash.
This is the reason web serves usually do such a redirect automatically.
...in proxied apps
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
This problem can also appear in reverse-proxied apps. It seems to appear in a strange half-working way, in fact: The proxy logic will recognize /proxiedapp and send a sensible internal path to the app(verify), so that the index does get loaded correctly, so the application will appear broken if it makes any relative references -- which, somewhat ironically, it should to work correctly behind a proxy in the first place.
Since the internal application cannot easily test for that slash (it does not know the browser URL it is visible under) and external redirects must include a full url, the proxied application itself cannot fix this using only proxied-app logic and HTTP. You could hack in a solution involving hardcoding, scripting and/or HTML redirects, but a cleaner solution is to set up the proxying correctly.
Apache's usually automatic redirects from slashless to slashful is not applied automatically since it doesn't know/assume anything about these paths.
A redirect at the proxying side is generally preferred over fixups on the proxied side.
Generally, the only problem case is the base URL for the app when it is missing the slash, so a common solution is to use mod_rewrite to fix exactly that one case:
RewriteEngine on RewriteRule ^/proxiedapp$ proxiedapp/ [R]
Hostname rewriting
You could catch people adding "www." when they shouldn't:
<VirtualHost 192.0.2.141> ServerName www.thing.example.com Redirect permanent / http://thing.example.com/ </VirtualHost> <VirtualHost 192.0.2.141> ServerName thing.example.com #The actual vhost: documentroot, etc. </VirtualHost>
This shows you don't need mod_rewrite for this unless you want to do flashy things.
One such flashy thing (which you can still do with a script instead of mod_rewrite) is using DNS wildcard subdomains. Consider dict.die.net - an url like http://foo.dict.die.net will cause an external redirect to http://dict.die.net/foo
This works at all because DNS allows wildcard subdomains (*), which means anything under a particular domain resolves as the same thing. The request ends up at the server, and since the browser reports the hostname as per HTTP1.1, the server can look at it and rewrite the URL accordingly.
Say you want:
- www.example.com be the host you want
- example.com to redirect to www.example.com
- anythingelse.example.com to redirect to www.example.com/anythingelse
One solution would be:
# This logic should be placed in the default vhost, since it is the place that # requests with hostnames we don't explicitly vhost end up in #Add the www if someone forgot it - like the fake vhost example above RewriteCond %{HTTP_HOST} ^example.com$ RewriteRule (.*) http://www.example.com$1 [L,R=301] #anything that isn't www. RewriteCond %{HTTP_HOST} !^www.example.com$ #gets through to here, so we can capture it: RewriteCond %{HTTP_HOST} ^(.*).example.com$ RewriteRule ^[/]?(.*)$ http://www.example.com/%1/$1 [L,R=301]
This would redirect moo.example.com/bah to www.example.com/moo/bah. You could of course also rewrite it to something like www.example.com/complainabouthostname.php?host=%1&fullhost=%{HTTP_HOST}&path=$1 or some such thing, that depends purely on what further handling you have in mind.
Note the distinction between %1 and $1: the former matches out of a RewriteCond, the latter out of the RewriteRule.
Filter out Bad Things
It is fairly common to see files with loads [NC,OR]ed RewriteConds in a row to deny:
- by HTTP_USER_AGENT, particularly site harvesters and bad bots
- by HTTP_REMOTE_ADDR, often zealously blocking bad bots by IP
- by HTTP_REFERER to turn referrer spam into rejections
- REQUEST_URI to identify attacks/virus scans
Nicer URLs
You can intentionally rewrite URLs, as a feature. For example, an out-of-the-box mediawiki has URLs that look like: http://example.com/wiki/index.php/Foo, but you wikipedia and such allow you to visit pages like http://example.com/wiki/Foo.
This is done by rewriting the second URL as if you'ld visited the first. This should be an internal rewrite (serving as if the client requested the rewritten URL) rather than an external one (telling the browser to go elsewhere).
For a simpler example (in the case of mediawiki there is some interaction with pages that should not have part of their URLs stripped), consider rewriting /news/2007/01/01/0001 to /news.php?id=2007010100001
RewriteRule ^news/([0-9]+)/([0-9]+)/([0-9]+)/([0-9]+) /news.php?id=$1$2$3$4
or perhaps /widgets/playlist/blue/ to /version2/playlist.php?color=blue
RewriteRule ^widgets/playlist/(.*) /version2/playlist.php?color=$1
RewriteMap uses
RewriteMap allows you to read key-value pairs from files (plain text or simple file based database formats), or even call an application to do the rewrite logic, allowing you to use data external to apache.
If performance is important, note that using scripting can help prevent IO load (not PHP unless you have a memcache).
Simple load balancing
You can use RewriteMap to specify a list of alternative values for a host to redirect to. (It won't scale as well as hardware balancing as all requests are probably still going to a main server you set this in, but if requests are heavy it's a nice start)
See http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html#rewritemap
Note that using mod_backhand is often a better way of load balancing.
Hotlinking
Testing for hotlinking tests whether the contents of the Referer: header (a misspelling of referrer, standardized in HTTP, and used with both spellings in discussions) are something you allow or not - usually '...come from the my domain.'
This header is set by browsers to the place that led to this request, usually:
- when visiting a new page based on a clicked link (referrer is set to the page the link was on), and
- for page-embedded content (referrer is set to the page the media is on)
If you add rules that require access to your media to have a referrer in the same domain and others you approve, other people embedding to and/or directly linking to the image won't see it.
It's possible to spoof and there is really no way for the server to see whether the UA is lying, but for spoofing to happen, the browser has to be forced. Only power users and random geeks will/can do this.
For example:
RewriteEngine on RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain.com/.*$ [NC] RewriteRule \.(gif|jpg)$ - [F]
The above roughly means:
- If Referer: is present (not an empty string) (redundant in this case)
- AND (without mentioning a boolean, AND applies. If you want ORs, write [OR])
- If Referer: doesn't have this domain ([NC] means case insensitive match)
- If URL ends with ".gif" or ".jpg", [F]orbid it - that is, trigger apache's '403 Forbidden' page.
It's not uncommon to to do an (external) redirect to a "This image is leeched" image URL.
'Image not found' image
Instead of letting browsers show broken images (or having them disappear), you could rewrite to an image that says 'missing image'
You can tell RewriteCond to check whether the URL is responsive (-U) or whether file serving would work (-F), for example:
RewriteCond %{REQUEST_URI} !-U # request won't work?
RewriteCond %{REQUEST_URI} !-U # request won't work?
RewriteRule ^catalog_image_path/.*\.gif$
This is done with an internal subrequest, so if you care about performance you may not want this, or want it only under a fairly strict RewriteCond.
You may prefer to write a (local) custom 404 handler script to do the same, and possibly more.
Restrict user agents
While it is simpler to use robots.txt, if they don't honor it (or to finally re-fetch it and put it into effect) but still accurately report their user agent string.
Example:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^SomeBadSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archive [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver
RewriteRule ^.*$ /badrobot.html [F,L]
(L, incidentally, forces this to be the [L]ast applied rule to the request)
Basically, if any of these user agents apply, you give them a particular page (a proper error may be better).
Another way of doing this is doing such tests without mod_rewrite by using enviroment variables (example lines from & thanks to this page):
SetEnvIf User-Agent CherryPicker BAD_BOT SetEnvIf Request_URI ^/default\.ida BAD_BOT=worm SetEnvIfNoCase Referer ^http://(www\.)?xopy\.com BAD_BOT=spammer
#Then deny based on that variable Order Allow,Deny Allow from all Deny from env=BAD_BOT
#Even doing further tests based on variables set earlier. # You can even filter things from your logs. First you set NOLOG for the cases you want to avoid # (for example only when BAD_BOT is set to spammer) SetEnvIf BAD_BOT spammer NOLOG # Then you use a CustomLog entry (make sure the log path makes sense for your system!) # with a condition like the following added: CustomLog /var/httpd/logs/access.log combined env=!NOLOG
Having such environment variables also allow you to ignore the corresponding log entries, or put them in a separate log file:

