Apache config and .htaccess - redirects and proxies

From Helpful
Revision as of 12:08, 6 May 2024 by Helpful (talk | contribs) (→‎Simple redirect to page elsewhere)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Related to web development, lower level hosting, and such: (See also the webdev category)

Lower levels


Server stuff:


Higher levels


📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.


Redirect or proxy?

When contents you want to serve are in another place, you have roughly two options:

  • tell the client to go there instead, OR
  • fetch it for them, and answer as us


If it's in a different different public server and the UA should be told that, you use Redirect headers, which can be emitted

from your own scripts
from apache config
static e.g. with mod_alias's Redirect
patterns e.g.
with mod_alias's RedirectMatch
mod_rewrite for some things RedirectMatch can't do


If it's on distinct private servers that the UA cannot reach directly, you would be reverse proxying, often done when

parts of the whole run on separate servers, ports and/or
parts of the whole run on a swarm and/or
you want to put load balancing in front and/or
you want to put HTTPS offloading in front and/or
you want to put caches in front and/or
you have other architectural reasons you largely don't want to expose


Apache has some further naming here, in particular internal versus external redirects

  • internal redirects means apache does a request on behalf of the UA, and invisible to it -- mostly just reverse proxying.
  • External redirects mean sending a permanent redirect or termporary redirect
  • URL rewriting, referring to the fact that a few apache modules can do either of the above, depending on what you tell it to do.
also related to apache's phases allowing interaction between rewriting and other modules (sometimes quite useful, sometimes only confusing)


If you care about SEO pageranky reasons, there's more discussion.

mod_alias (headers)

There is an optional status argument, one of

temp - (302). This is the default if omitted
permanent (301) indicating that the resource has moved permanently.
seeother - (303 See Other)
gone (410 Gone) (no URL argument), will only really give an error page and no redirect

See also HTTP_notes#HTTP_redirect for more on redirect types.


Redirect matches-and-replaces a string prefix. For individual paths you could do:

Redirect permanent /app.html http://app.example.com/
Redirect permanent /foo.html /new/foo/foo.php


Note that the logic is roughly 'swallow the matched path, paste on the rest', which is likely to bite you if you didn't just just move hosts but also restructured.

For example,

Redirect permanent / http://www.example.com/sub/index.html

...isn't what you want - sourcehost.example.com/test.html would lead to http://www.example.com/sub/index.htmltest.html (verify)

You'd probably move to use RedirectMatch to match patterns, in the last example swallowing the path like:

RedirectMatch permanent /(.*) http://www.example.com/sub/index.html


In general, if you want to selectively remap based on patterns other than (directory) prefixes, regexps may be friendly. Consider:

 RedirectMatch (.*)\.jpg$ http://static-img.example.com$1.jpg

mod_rewrite (headers or proxy)

You may sometimes need mod_rewrite over mod_alias when basic rules aren't flexible enough.

also, avoid mixing them -- if both apply to a URL, you may get undefined behaviour

(Side note: When you use it, you may want to enable it (RewriteEngine on) only on vhosts/dirs where you actually use it, simply because of speed reasons: mod_rewrite hooks into a phase that applies to all requests, and even doing nothing takes some time, if not a lot.)


In mod_rewrite you write a list of RewriteRules and RewriteConditions in each context you want rewriting.

Their syntax:

RewriteCond tested_string condition_pattern [CondFlags]
# and
RewriteRule url_pattern url_substitution [RuleFlags]


Both act as filters, in different ways, and both may terminate rewrite processing(verify).

RewriteCond primarily tests a request's environment, while RewriteRule primarily applies the rewrite on the request URL.

Both RewriteCond and RewriteRule can match using regexps, and both can capture values.

A RewriteRule can also refer to a capture in a RewriteCond, but this is often not necessary in practice, so rarely used. See the hostname rewriting example below.


Notes:

  • You can enable logging of the rewrite operations (see RewriteLog and RewriteLogLevel), which is useful while debugging your rewrite rules, but disable it in production.
  • Note that most anything mod_rewrite can do can also be done by scripting you write.
That'd often be a little slower, but can be a lot more flexible, and/or readable, and perhaps more easily administered by e.g. having the redirects in a database


RewriteCond

Condition flags are:

  • [OR] chains this condition with the next condition in an OR-ing way (instead of the default AND)
  • [NC] makes text matching case insensitive


You can get a basic string comparison by prepending = (or, by implication, !=), otherwise you get a regexp match (where you probably want to use [.] instead of .).


You can test against various strings, often from the environment using the form %{VARNAME}. You can augment this, but the basis should include at least the following:

  • Request details: REQUEST_URI, REMOTE_HOST, REMOTE_ADDR, REQUEST_METHOD (as in GET, POST, etc.), REQUEST_URI, QUERY_STRING, and others
    • HTTP headers: HTTP_REFERER, HTTP_USER_AGENT, HTTP_COOKIE, HTTP_HOST, HTTP_ACCEPT and others
  • Server details and configuration: SERVER_NAME, SERVER_ADDR, SERVER_PORT, SERVER_ADMIN, SERVER_PROTOCOL, SERVER_SOFTWARE, SERVER_VERSION, DOCUMENT_ROOT, API_VERSION, and others
  • Various time variables (TIME_YEAR, TIME_MON, TIME_DAY, TIME_HOUR, TIME_MIN, TIME_SEC, TIME_WDAY)
  • (verify): THE_REQUEST, REQUEST_FILENAME, IS_SUBREQ
  • Others, in specific cases. For example, doing internal redirects adds a variable or two.
  • Anything you may have configured to be set, such as SetEnvIf rules.


RewriteRule

Tests against the current URL's path -- to test against hostname othe query string you need RewriteCond.


RewriteRule flags include the following. Note that most people use the single-letter form, but spelling them out may be more readable.

Processing/logic:

  • nocase, NC: case insensitive URL match
  • noescape, NE: don't apply URL escaping to the rewrite part
  • next, N: restart rewriting (with URL as it current is rewritten). Take halting-problem-like care:)
  • chain, C: When you chain a number of rules, one being false skips the rest of the chain
  • skip=n, S=n: skip the following n rules. Can be used to imitate if-then-else
  • last, L: last entry: Stop processing the list.
  • forbid, F: forbid access: Send a 403 page
  • redirect, R: an external redirect (with an optional HTTP code; default is 302)

Change handling/content (these all take arguments; read up on their use):

  • type, T: set MIME type
  • env, E: set environment variable
  • qsappend, QSA: add data to URL query string
  • cookie, CO: add a cookie

Unsorted:

  • passthrough, PT
  • proxy, P


When you are restructuring URLS, you might well pick out parts from the old one using regexp capturing, referencing groups with $1 and such. You can add things from the environment (%{}) in the rewritten URL if you wish.

mod_proxy (proxy)

mod_proxy can do forward proxying - you know, the thing you have to specifically configure a browser (and all other UAs) for, so that all requests go to that proxy server, which talks to the real world for you. These are useful for admins to add a cache, or filter, between them and the real world, or to allow only HTTP to the internet and nothing else. (there are also ways to make such a proxy invisible, but even so...)

...but we don't really do that much, so in most cases you'ld do:

ProxyRequests Off


What we're primarily here for is HTTP reverse proxies, where an UA requesting from server leads to that server fetching it from elsewhere.


For example, consider a host that wants to hand off requests to URLs under /img and /app:

ProxyRequests Off  #not necessary for a *reverse* proxy
ProxyPass /img/ http://internal.example.com/img/
ProxyPass /app/ http://internal.example.com/app/


Note that

  • in some cases you will be selectively relaying some requests to different servers
  • in others you may want a reverse proxy in front your real web servers, rather than as a part of one.

(You can also look at things like pound, nginx, and such. They are generally faster at this job)

  • (if offloading to multiple servers, also look at mod_proxy_balancer, but again, also take a look at nginx and similar)
  • You can hand options into ProxyPass that control connection limits and simple load balancing, though sufficiently large sites probably want something more lightweight than apache for reverse proxying)


Problems

ProxyPass results in a very simple pass-through of data.

Things that can co wrong include any self-reference:

  • External redirects (Location header)
including the slashless-to-slashful URL that web servers regularly use for directories
  • (absolute/parent) paths in URLs in the content (unless the directory structure is the same on both hosts)
  • hostnames in URLs in the content, e.g. the web page
including e.g. HTTP→HTTPS redirects
  • paths in cookies, and hosts/domains in cookies, for the same reasons


Depending on the details of your proxypass rule, you may also create a trailing-slash problem.


Solutions
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

There are different solutions to different parts.

like Location, Content-Location, and URI headers.
In the example above, the following rule works for both proxy-passes:
ProxyPassReverse / http://internal.example.com/


As to paths/hostnames in the HTML content:


Apps expecting to be run behind proxies, which is getting somewhat more commmon in container world, may be aware of the issue and help you

you often configure them to say what they'd look like to the outside (domain, path / root URL)


If you are proxying apps that you can't easily make proxy-aware, you can get a works-mostly workaround by using an apache output filter that parses and rewrites the generated HTML: mod_proxy_html (a fast SAX parser, aware of HTML 4 and XHTML 1).

In general, however, you want to write your apps relative-directory-aware (which is usually quite simple and just a matter of a little consistency and a little rewriting) and reverse-proxy-host-aware, which is a little harder.


There is the ProxyPreserveHost directive, which sends the (proxying side's) Host: line for the proxy request instead of the one you specified in ProxyPass.

This makes the most sense when it is the backend server that does every evaluation necessary, and the proxyier doesn't.


Alternatively, or additionally, you can make your proxier add some headers. The three generally relevant:

  • X-Forwarded-For: the actual client's IP address (so that a proxied app can tell real client apart from the more immediate client, the proxy) [1]
  • X-Forwarded-Host: The original request's Host:, usually so that the application can tell without the proxy having to touch the internal request's Host: header to that end.
  • X-Forwarded-Server: The proxy's hostname


Some examples

Simple redirect to page elsewhere

The simplest example is perhaps rewriting one page to another:

RewriteRule ^alice.html$ bob.html

Note:

  • This example has the same effect as mod_alias's Redirect temporary alice.html bob.html
  • If you're matching directories, you may want to do ^/dirname/?$ to be robust to requests omitting the final slash.

Logging select requests

It may be interesting to log things like attempted attacks:

RewriteRule  .*cmd\.exe    /attacklog.php?what=nimda
RewriteRule  .*root\.exe   /attacklog.php?what=nimda
RewriteRule  .*Admin\.dll  /attacklog.php?what=nimda

Note that in some cases you may have a custom 404 handler script for a whole site, and it's easier to handle this sort of thing in its logic.

Server redirecting/proxying (internal)

In, say, a VirtualHost with very little else:

RewriteEngine on 
RewriteRule ^(.*) http://127.0.0.1:8080$1 [P]

That proxy flag actually works by handing over control to mod_proxy, so this is similar to reverse proxy configuration.


If apache complains that you made an "attempt to make remote request from mod_rewrite without proxy enabled", this may well be because the loading of mod_proxy's is wrapped in a <IfDefine PROXY>, and you haven't configured apache to add -D PROXY to wherever such definitions go (/etc/conf.d/apache2 on my system) to cause apache to actually load mod_proxy.


This can be handy to redirect to standalone webservers, FastCGI-style apps and/or hide port numbers from your URLs - hence the internal proxy.

Note that if the proxied application writes abosolute URLs including the hostname, it may not make self-references well. You probably want to read up on using the ProxyPreserveHost directive, and/or on the X-Forwarded-Host and X-Forwarded-FOR headers.

Details often depend on how flexible/smart the backing app is.


Trailing slash

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

To demonstrate the problem:

Given a HTML page served at the browser-visible URL http://example.com/proxiedapp, if that HTML page contains a relative link, say, admin/foo.html, that will resolve to http://example.com/admin/foo.html and not to the probably intended http://example.com/proxiedapp/admin/foo.html.

This is correct according to path semantics -- proxiedapp is not the (virtual or real) directory you think it is or want it treated as. If you want it treated that way, you need a final slash.

This is the reason web serves usually do such a redirect automatically.


...in proxied apps
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This problem can also appear in reverse-proxied apps. It seems to appear in a strange half-working way, in fact: The proxy logic will recognize /proxiedapp and send a sensible internal path to the app(verify), so that the index does get loaded correctly, so the application will appear broken if it makes any relative references -- which, somewhat ironically, it should to work correctly behind a proxy in the first place.

Since the internal application cannot easily test for that slash (it does not know the browser URL it is visible under) and external redirects must include a full url, the proxied application itself cannot fix this using only proxied-app logic and HTTP. You could hack in a solution involving hardcoding, scripting and/or HTML redirects, but a cleaner solution is to set up the proxying correctly.


Apache's usually automatic redirects from slashless to slashful is not applied automatically since it doesn't know/assume anything about these paths.

A redirect at the proxying side is generally preferred over fixups on the proxied side.

Generally, the only problem case is the base URL for the app when it is missing the slash, so a common solution is to use mod_rewrite to fix exactly that one case:

RewriteEngine on
RewriteRule ^/proxiedapp$ proxiedapp/ [R]

Hostname rewriting

You could catch people adding "www." when they shouldn't:

<VirtualHost 192.0.2.141>
  ServerName  www.thing.example.com
  Redirect permanent /  http://thing.example.com/
</VirtualHost>

<VirtualHost 192.0.2.141>
  ServerName  thing.example.com
  #The actual vhost: documentroot, etc.
</VirtualHost>


If you just care about / then that's enough. If you wanted to catch and rewrite all paths under that host, you would still want mod_rewrite to do

RewriteCond %{HTTP_HOST} ^www.thing.example.com$
RewriteRule (.*) http://thing.example.com$1 [L,R=301]

(there are a few footnotes to this, with slashes and https and such)



A slightly flashier thing (which you can still do with a script instead of mod_rewrite) is using DNS wildcard subdomains.

For example, http://foo.dict.die.net would cause an external redirect to http://dict.die.net/foo

This works at all because DNS allows wildcard subdomains (*), which means anything under a particular domain resolves as the same thing. The request ends up at the same server, and in the same apache virtualhost, and since the browser reports the hostname as per HTTP1.1, the server logic can look at it and rewrite the URL accordingly.


Say you want:

  • www.example.com be the host you want
  • example.com to redirect to www.example.com
  • anythingelse.example.com to redirect to www.example.com/anythingelse

One solution would be:

# This logic should be placed in the default vhost, since it is the place that
# requests with hostnames we don't explicitly vhost end up in

#Add the www if someone forgot it - like the fake vhost example above
RewriteCond %{HTTP_HOST} ^example.com$
RewriteRule (.*) http://www.example.com$1 [L,R=301]

#anything that isn't www.
RewriteCond %{HTTP_HOST} !^www.example.com$
#gets through to here, so we can capture it:
RewriteCond %{HTTP_HOST} ^(.*).example.com$
RewriteRule ^[/]?(.*)$ http://www.example.com/%1/$1 [L,R=301]

This would redirect moo.example.com/bah to www.example.com/moo/bah. You could of course also rewrite it to something like www.example.com/complainabouthostname.php?host=%1&fullhost=%{HTTP_HOST}&path=$1 or some such thing, that depends purely on what further handling you have in mind.

Note the distinction between %1 and $1: the former matches out of a RewriteCond, the latter out of the RewriteRule.

Filter out Bad Things

It is fairly common to see files with loads [NC,OR]ed RewriteConds in a row to deny:

  • by HTTP_USER_AGENT, particularly site harvesters and bad bots
  • by HTTP_REMOTE_ADDR, often zealously blocking bad bots by IP
  • by HTTP_REFERER to turn referrer spam into rejections
  • REQUEST_URI to identify attacks/virus scans


Nicer URLs

Some people internally rewrite URLs, as a feature.

For example, an out-of-the-box mediawiki has URLs that look like:

http://example.com/wiki/index.php/Foo

With a little fiddling you can configure it so that browsers visiting

http://example.com/wiki/Foo

This should be an internal rewrite (serving as if the client requested the rewritten URL) rather than an external one (telling the browser to go elsewhere).


While it's probably better to write apps to accept the paths you want in the first place - rather than counting on a little extra work on every request - in this case it can make sense because doing this with PHP is more bother.


For a simpler example (in the case of mediawiki there is some interaction with pages that should not have part of their URLs stripped), consider rewriting /news/2007/01/01/0001 to /news.php?id=2007010100001

 RewriteRule ^news/([0-9]+)/([0-9]+)/([0-9]+)/([0-9]+) /news.php?id=$1$2$3$4

or perhaps /widgets/playlist/blue/ to /version2/playlist.php?color=blue

 RewriteRule ^widgets/playlist/(.*) /version2/playlist.php?color=$1

RewriteMap uses

RewriteMap allows you to read key-value pairs from files (plain text or simple file based database formats), or even call an application to do the rewrite logic, allowing you to use data external to apache.

If performance is important, note that using scripting can help prevent IO load (not PHP unless you have a memcache).


Simple load balancing

You can use RewriteMap to specify a list of alternative values for a host to redirect to. (It won't scale as well as hardware balancing as all requests are probably still going to a main server you set this in, but if requests are heavy it's a nice start)

See http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html#rewritemap

Note that using mod_backhand is often a better way of load balancing.




Hotlinking

Testing for hotlinking tests whether the contents of the Referer: header (a misspelling of referrer, standardized in HTTP, and used with both spellings in discussions) are something you allow or not - usually '...come from the my domain.'

This header is set by browsers to the place that led to this request, usually:

  • when visiting a new page based on a clicked link (referrer is set to the page the link was on), and
  • for page-embedded content (referrer is set to the page the media is on)

If you add rules that require access to your media to have a referrer in the same domain and others you approve, other people embedding to and/or directly linking to the image won't see it.

It's possible to spoof and there is really no way for the server to see whether the UA is lying, but for spoofing to happen, the browser has to be forced. Only power users and random geeks will/can do this.


For example:

RewriteEngine on

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain.com/.*$ [NC]
RewriteRule \.(gif|jpg)$ - [F]

The above roughly means:

  • If Referer: is present (not an empty string) (redundant in this case)
  • AND (without mentioning a boolean, AND applies. If you want ORs, write [OR])
  • If Referer: doesn't have this domain ([NC] means case insensitive match)
  • If URL ends with ".gif" or ".jpg", [F]orbid it - that is, trigger apache's '403 Forbidden' page.

It's not uncommon to to do an (external) redirect to a "This image is leeched" image URL.


'Image not found' image

Instead of letting browsers show broken images (or having them disappear), you could rewrite to an image that says 'missing image'


This is done with an additional, internal subrequest, so if you care about performance (or dynamically generated images(verify)) you may not want this, or want it only under a fairly strict RewriteCond.

You may prefer to write a (local) custom 404 handler script to do the same, and possibly more.


You can tell RewriteCond to check whether the URL is responsive (-U) or whether file serving would work (-F), for example:

RewriteCond %{REQUEST_URI} !-U  # request won't work?
RewriteCond %{REQUEST_URI} !-U  # request won't work?
RewriteRule ^catalog_image_path/.*\.gif$

Restrict user agents

While it is simpler to use robots.txt, if they don't honor it (or to finally re-fetch it and put it into effect) but still accurately report their user agent string.

Example:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^SomeBadSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow    [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archive    [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver
RewriteRule ^.*$ /badrobot.html [F,L]

(L, incidentally, forces this to be the [L]ast applied rule to the request)

Basically, if any of these user agents apply, you give them a particular page (a proper error may be better).


Another way of doing this is doing such tests without mod_rewrite by using enviroment variables (example lines from & thanks to this page):

SetEnvIf       User-Agent  CherryPicker               BAD_BOT
SetEnvIf       Request_URI ^/default\.ida             BAD_BOT=worm
SetEnvIfNoCase Referer     ^http://(www\.)?xopy\.com  BAD_BOT=spammer

#Then deny based on that variable
Order Allow,Deny
Allow from all
Deny from env=BAD_BOT

#Even doing further tests based on variables set earlier. 
# You can even filter things from your logs. First you set NOLOG for the cases you want to avoid
# (for example only when BAD_BOT is set to spammer)
SetEnvIf BAD_BOT spammer NOLOG
# Then you use a CustomLog entry (make sure the log path makes sense for your system!) 
#  with a condition like the following added:
CustomLog /var/httpd/logs/access.log combined env=!NOLOG


Having such environment variables also allow you to ignore the corresponding log entries, or put them in a separate log file:

See also