HTTP notes

From Helpful
Revision as of 17:03, 7 September 2024 by Helpful (talk | contribs) (→‎Some result status notes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Related to web development, lower level hosting, and such: (See also the webdev category)

Lower levels


Server stuff:


Higher levels


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

HTTP 1.x

Minimal requests

In some cases, such as debugging via netcat, or perhaps doing requests from microcontrollers or other embedded devices, it's useful to do a quick and dirty HTTP request with a minimal implementation on your side (for HTTP 1.0 and a very simple response body, you can get away with just a few strtoks and such).

For HTTP 1.0, the most basic is just GET / HTTP/1.0. You regularly want to support servers that do named virtual hosts, in which case you want:

GET / HTTP/1.0
Host: www.example.com

As in, those are all the bytes on a bare TCP connection required to get a response. To wit, a command line example:

echo -e "GET /index.html HTTP/1.0\nHost: www.example.com\n" | netcat www.example.com 80

(echo's -e interprets the newlines inside that, which makes it possible to write on a single line)


HTTP 1.1 requires that Host header.

And, technically, a bunch more. When doing minimal requests to HTTP1.1 servers you probably want to still say you're 1.0, because saying you're 1.1 suggests you support various extensions you're probably are not. You can often get away with it, but you can't count on it.

Some result status notes

The major groups are roughly:

  • 1xx Informational
  • 2xx Success
  • 3xx Redirection
  • 4xx Client Error (bad request, unknown url, bad auth, etc.)
  • 5xx Server Error (scripting error, gateway timeout, server overloaded, etc.)

A few trigger automatic client handling. A decent number only have well defined use within specific protocols (e.g. WebDAV).


Some of the better known codes:

  • 200 OK - the most generic indication that a request and response went well
  • 404 Not Found - no resource under that URL
  • 500 Internal Server Error - the most generic error you get from dynamic page generation, roughly meaning "...something broke"
  • 401 Unauthorized - You may be able to go here if you authorized
  • 403 Forbidden - valid request, but server won't serve this resource to you (e.g. valid credentials but no authorization -- but there may be entirely different reasons)
  • 301 Moved Permanently - permanent redirect (see below)
  • 302 Found - temporary redirect (see below)
  • 400 Bad Request - Request that doesn't make sense (often bad syntax or malformed).
Also sometimes used as an application error code - e.g. REST-like things complaining about non-sensible requests to the service being provided


Some others I've seen repeatedly:

  • 503 Service temporarily overloaded roughly means "something temporarily wrong, trying again later should probably work."
what triggers this varies. Possibly something doing smart allocation, sometimes internal timeouts report as this (instead of a 504).
  • 504 Gateway timeout - server doing proxying to backend servers didn't get a (timely) response from a backend server
if that process is doing lots of work (consider offloading not-immediately-necessary work to a background process)
if that process is waiting on something else (like database), figure out why
  • 502 Bad Gateway - usually means invalid response from a backend server
that is, most setups have workers, and something in front of delegating. This is the thing in front complaining about the workers


Also:

  • 408 Request Timeout
"The client did not produce a request within the time that the server was prepared to wait".
May be a trashing client, a DoS attack, network load.
If it happens rarely, it may be hard to find out why. If it happens regularly, your chances may be slightly better.
  • 410 Gone - "yes, something used to be here. It isn't now."
Sometimes means a temporary configuration problem. Probably more often means the web server knows something is permanently gone and informs the client about it.(verify)
  • 204 No Content - "I return no body, and that's intentional"
e.g. one way of detecting captive portals, in part because this is one specific case that is rarely returned by regular websites or by captive portals.


HTTP redirect


What

The most commonly used redirect statuses are 301 and 302, probably largely because they are were the only ones in HTTP/1.0(verify)).

  • 301 Redirect (MOVED_PERMANENTLY)
    • meaning "we have moved this, go there (and it'd be useful if you always did so in the future)"
    • cacheable by default (so not good for temporary workarounds -- re-visiting clients may not notice in a while)
    • used for moving domains, for redirecting between domains and sites (e.g. from example.com to www.example.com), for site reorganisation (...and using this as URL aliasing so that you don't break all old URLs - though many people are too lazy for this), for moving the pagerank from an old to a new location (not sure this is as true as some people make it sound, but google itself suggests using 301 over 302), and others
  • 302 Found (MOVED_TEMPORARILY)
    • meaning "fetch this content over there for now, but in the future come back to the URL you just used."
    • cacheable only with explicit instructions (verify)
    • used for redirect services such as tinyURL (302 is also justifiable?), for Temporarily redirecting to backup content while restoring main content (Though in practice backups are often out of date enough to make a "We're working on restoring the site" notice more practical), and others


And since HTTP/1.1(verify):

  • 307 Temporary (TEMPORARY_REDIRECT)
    • useful for pages that use POST, since it instructs the browser to POST the same thing to the new URL (while 301 and 302 seem to imply GET or be undefined)(verify)
    • cacheable only with explicit instructions (verify)
  • 303 See Other
    • like 307, but instructs to do a GET instead. Primarily useful for scripts that may/always take POST requests to redirect to a basic URL / GET. (while 302/301 are relatively method-agnostic)(verify)
    • not cacheable


In all cases you also need to add a Location response header, the value of which is thew new URL (should be an absolute URL, although browsers may choose to work with relative ones as well).


Notes:

  • older and simpler user agents may understand only 301 and 302, not 303 and/or 307.
Current browsers can be assumed to understand. Crawlers vary a bunch more.
  • It is reasonable for a 301 to be cached (by browsers, ISPs, and other caches) - after all, they are intended to be permanent. A 302 usually won't be (unless you instruct it(verify)). This is one real reason not to use 301 for something intended to be temporary - the originating server can't change this until the cache expires.
  • Whether 302s are cache seems to be mostly controlled by cacheing headers. (HTTP 1.1 mentions "This response is only cacheable if indicated by a Cache-Control or Expires header field.")

Practice

Can I have relative redirects?

Originally not, RFC 2616 requires an absolute URI in Location.

But that was replaced by RFC 7231, which says "When it has the form of a relative reference [...], the final value is computed by resolving it against the effective request URI".

Some UAs were be more permissive, but you couldn't count on it. You still can't count on all UAs to be up to date to RFC 7231, but for browsers you basically can.



Mass redirects

When moving sites, the lazy fix is to redirect all paths under the old site to the new site's root, but you may prefer to have redirects for every page to its actual new location.

Whether that's based on a rule or a big list, you probably want to offload these individual redirects to the web server, using something like mod_rewrite in a apache, or similar in nginx.



On rel="canonical"

rel=canonical is page content that tells crawlers what the preferred URL of the web page is, within the same domain.

This is only used by crawlers, and does not act as a redirect for UAs, so it's partly an SEO thing, and partly help slightly cleaner search results.


This can be useful if there are multiple correct URLs, like a blog having a shortened URL and a more expressive URL for posts, or e-commerce systems having multiple ways to view the same product.

<link rel="canonical" href="https://example.org/" />

(See also link rel)

Temporarily Unavailable (503)

Useful to signal 'come back later', mostly for spiders so that they are less likely to decide you've dropped off the internet.


Some header notes

User-Agent

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

UA, in the context of the web, is short for User Agent.

While a more general concept, it often relates to the User-Agent HTTP header.

...and it frequently means 'browser', but the term 'User Agent' is often used when someone wants to point out that spiders/crawlers/scrapers, mechanical browsers, and whatnot are also consumers of the same content.


There's not too much of a standard to the value. Yes, there are things like RFC9110 (HTTP Semantics), and previously the HTTP 1.1 and HTTP 1.0 RFCs, but they all mostly just say:

User-Agent = product *( RWS ( product | comment ) )

...meaning

one or more product identifiers (if multiple, then in order from most to least identifying),
each followed by zero or more comments

They were once a little simpler and more identifying, for example:

CERN-LineMode/2.15 libwww/2.17b3

...but browsers now looks like a mash of mentions, and almost everything starts by claiming it's Mozilla/5.0.

For reasons, in part because there was a time webdevs did browser detection via this header, and while there were a bunch of browers that did use the same engine, IE also impersonated Netscape to defeat said detection (ah, IE, we'll miss blaming almost everyting non-standard on you). And instead of strongly discouraging such non-standard things, we just did such browser detection in increasingly poor ways, also moving on to doing it poorly within javascript instead, which is part of why UA header detection has be fairly irrelevant for probably two decades.

But "who knows what servers may still do weird shit, why risk it?", so "Mozilla/5.0" became the magic incantation that starts the UA string, and lo, it was stupid and empty.

Now rinse poorly, and repeat repeatedly.

Apple came in and created Safari based on linux's KHTML, and changed enough to call it by a new name, WebKit. But it didn't want servers to omit things that KDE linux users might get, so it mentioned Mozilla/5.0 and AppleWebKit and KHTML, like Gecko and Safari/version (and mentions the most interesting bits last in opposition to the RFC) something like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36

Did I fool you? I lied! That's actually Chrome, once that started existing. Chrome, which used WebKit then and for a good while, pretended to be KHTML, which pretended to be like Mozilla, also making it sort of like Safari, so it mentions all of those things.


And then there's Edge, which (except in a few versions where Edge still used a fork of trident) is a thin skin over chrome, minus the google stuff, so they just stuck what they really were at the far end (opposite to the RFC's point):

Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36 Edge/12.0


It seems Firefox is the least confusing one, at least keeping its dissociative disorder in the family:

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0


The going convention seems to be

Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>


Bots/crawlers/spiders are sometimes simpler, often roughly

BotName/2.1 (+http://botname.example.com)

Or possibly

Mozilla/5.0 (compatible; botname/1.0; +http://botname.example.com)  

...but then there's even more variation.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

https://en.wikipedia.org/wiki/User_agent#Format_for_human-operated_web_browsers

https://webaim.org/blog/user-agent-string-history/

Location

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Quoth RFC 2616:

The Location response-header field is used to redirect the recipient
to a Location other than the Request-URI
for completion of the request or identification of a new resource.

Mostly useful for (external/HTTP) redirects - see #HTTP_redirect. Also used in a few other places, such as 201 Created to refer to the URL that was created.


Value classically should be an absolute URL, though most browsers these days will also deal with relative URLs.


See also:

Content-Location

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(Note: has little to do with Location, and you are often looking for that instead of this)


For multiple-entity resources, such perhaps the same page in other languages, this header lets you signal those as alternative URIs.

Says RFC 2616:

The Content-Location entity-header field MAY be used to supply 
the resource Location for the entity enclosed in the message
when that entity is accessible from a Location separate from the requested resource’s URI.


Seems to be done mostly in response to an applicable Accept header in the request, and is more or less a means of content negotiation.


May be absolute or relative URI. Undefined for POST and PUT, so you should probably stick to GET.

Has implications on caching, which might e.g. use this association to flush all variants of stale content.

Rarely used in web browsing(verify); perhaps most applicable to (MIME) multipart content[1] (and some things that can use that, like SOAP), or perhaps for HTTP-based protocols where you can apply somewhat stronger meaning, which seems to describe how it is used in Atom (see RFC 5023).

Content-Disposition

A response header, not part of HTTP/1.1, but fairly widely supported anyway.


Value of the header is a disposition-type plus optional parameters. Parameters start with a semicolon for separation.

There are two disposition-types:

  • attachment
useful to force download for a MIME type the UA can handle otherwise (side note: your other main option for that is to set the mime type to something the browser will always download, typically octet-stream)
  • inline
process as you normally would, which is the default, so specifying this is only useful when you specify parameters.


Parameters:

  • filename - specify the filename the UA ought to save as.
filename characters must be in ISO-8859-1 (Latin1)
Should be quoted if it contains spaces
  • filename*
filename characters can use RFC 5987 (see also below)
Not supported by all UAs, so if you use this, you should also add filename


Example from the RFC:

Content-Disposition: attachment; filename="EURO rates"; filename*=UTF-8''%e2%82%ac%20rates

See also:

  • RFC 6266 (earlier mentions in RFC 2616, RFC 2183, RFC 1806)

RFC 5987 text coding

...mainly for strings in HTTP headers.


The minimum required set of encodings specified this way is ISO-8859-1 and UTF-8 (and producers must use one of these).

As you often use this to break out of being restricted to ISO-8859-1, this seems usually used to get UTF-8.


The value is basically

  • charset (ISO-8859-1 or UTF-8, possible future additions)
  • '
  • optional language tag
  • '
  • percent-escaped bytestring (where the bytestring is coded according to the charset)

Examples:

iso-8859-1'en'%A3%20rates
UTF-8''%e2%82%ac%20rates

WWW-Authenticate and Authorization

HTTP Basic Auth notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

HTTP Basic auth is Base64-ing of username:password


When you do a regular request, a server can respond with 401 ('Authorization Required') to signal HTTP-based auth is needed, and signals using Basic auth by adding a response header like:

WWW-Authenticate: Basic realm="Friendly name of area you are authenticating"

...and for which the body is usually a very minimal HTML document telling you that you are unauthorized -- which most browsers will only show if you fail or refuse the following: Most browsers will pop up a dialog for login (mentioning the realm name), and automatically do a HTTP request on the same URL with this authentication attempt in it.


That second (and successive) requests will include an Authorization header, like:

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==

...which is the Base64 encoding of plain text, here of Aladdin:open sesame, a colon-separated username and password.


There is not really a means of logout, no HTTP signaling for it.


Upside:

  • means people have to enter something valid to enter

Downside:

  • sends passwords in what amounts to plain text (as base64 is reversible trivially, and by design)


This is not secure

When this is sent over anything unencrypted, anyone with the ability to sniff traffic can read it out.

Over HTTP, this was a bad thing to do. Now that HTTPS is the default (mostly) this is much less of an issue, but you must never accidentally disable that, and it's a separately configured thing.

So the only value is identifying users, not security, and don't use basic auth at all, unless you have a very good reason it's fine.


See also:

HTTP Digest Auth notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Originally described by RFC 2069, replaced by RFC 2617 which also describes a somewhat stronger variant (that can be backwards compatible with 2069).


Once you understand Basic auth, you can consider this a variant that

  • sends a hash instead
  • uses a client-generated nonce (lessens replay attacks, some pre-image attacks)

It's not ideal, but it's better than Basic auth.



Uses the same headers names as Basic, but with start with Digest and contain more values/options.

Exactly how the exchange should go seems to depend on two variables of the exchange, but it'll be be some variation on the theme of:

HA1 = MD5( username + ':' + password  )
HA2 = MD5( method   + ':' + digestURI )
response = MD5( HA1 + ':' + nonce + ':' + HA2  )

Note that some of the them Note that:

  • the first isn't as strong
  • most servers and clients support RFC 2069 style and qop-auth, but fewer support qop=auth-int


Security

Basically, it's an MD5 hash. Or rather two, which is an okay idea, and it's a solid few steps better than Basic auth.

Also, it allows client nonce, which should help against chosen plaintext attacks and things like rainbow table optimization, and some replay attacks (...in that a server can more easily do certain checks - it's not guaranteed that it will)

Still vulnerable against man-in-the-middle.

Semi-sorted

Expect

OPTIONS

The idea behind OPTIONS is to check which HTTP methods are allowed for a specific URL (or the server in general, via *), communicated via headers (mostly Allow(verify)).


It may have a body, but no use of it is described in RFC 2616. This seems meant for protocols on top of HTTP to do something potentially useful.


It seem OPTIONS is mostly seen in the context of things like WebDAV (part of the protocol) and CORS (for its preflight requests), so you rarely really have to respond to these yourself.

And since OPTIONS doesn't allow cacheing, not seeing it generally used may be a good thing.


CONNECT

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

https://en.wikipedia.org/wiki/HTTP_tunnel

connection types

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
requests from a webpage to its originating server, e.g. to support single-page apps
  • WebSockets
a bidirectional connection initiated via HTTP
chunks of content are as separate messages. Beyond that, you decide your protocol.
useful for live content, push messages, etc.
also a bit of a firewall cheat (since it uses port 80 which is typically allowerd)
  • Comet[2] refers to some implementation that effectively allows server push
    • long-polling - basically means the client does a request, and the server only responds when it has something to say, which may be much later.
    • streaming - uses a persistent connection. Browser implementation variation means it is hard to make this robustly portable(verify).
  • HTTPS is HTTP over TLS or SSL
  • Server-sent events[3]
Basically a TCP connection initiated via a HTTP proxy
  • BOSH, Bidirectional-streams Over Synchronous HTTP
means using two HTTP connections to simulate one bidirectional one (verify)
Only really used by XMPP?


Range requests

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Mainly to support download continuing, and otherwise conserve bandwidth.


Server may indicate support via:

Accept-Ranges: bytes

It may also choose to indicate lack of support via

Accept-Ranges: none

A server may also choose to ignore a Range request.


The above header is advice.

A client may always ask for it, and should figure out lack of support from the response

basically whether it's a 206 with a Content-Range response header).


A client can do a request like

Range: bytes=0 - 499

The server (unless it needs to answer with 416 Range Not Satisfiable) would respond with a 206 Partial Content response with a response header like:

Content-Range: bytes 0-499/10000

Which is describes the byte range byterange / completelength where

Notes:

  • only applies to GETs; range must be ignored on everything else
  • byte ranges are zero-based offset, and inclusive
so 0-499/500 is the entire file; watch for off-by-one bugs
  • completelength may be * for unknown
  • the last position can be omitted, implies 'until the end'
e.g. 0- means everything
  • the first position can be omitted
e.g. -500 means last 500 bytes (but you still have to know the size)


  • You can ask for multiple sub-ranges in one request via comma separation
e.g. 0-0,-1 means the first and last bytes
this is more complex on both the client (bookkeeping) and server (multipart dealie, and it may coalesce adjacent ranges)
few clients will ever do this
clients that do not understand multipart responses shouldn't ask for them :)
servers do not all not implement this (technically a violation of the RFC)
response is a little more detailed


  • If-Range is the combination of this with Etag
basically a "entire file if different, subrange if still the same" deal(verify)


See also:

  • RFC 7233

See also

HTTP/2

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

HTTP/2 smooths a few things which in HTTP1.x were workarounds, and tends to somewhat lower latency in the process.

(it seems that SPDY was the experimental version, and everything useful ended up in HTTP/2(verify))


HTTP/2 does not change HTTP1's semantics, but is a completely different transport at byte level.

You can see it as an API where the byte level is now handled for you. (Which is just as well. In HTTP/1.0 you could still do it all yourself because it was minimal, while proper 1.1 compliance was already quite hard)

Because of the same semantics, dropping it in shouldn't break anything at application level, though the change can still be a bunch of work.


Interesting things it adds include:

  • Request/response multiplexing
basically a better version of pipelining, in that it does not have head-of-line blocking issues
...except that under packet loss it still does have head-of-line blocking, because of how TCP recovers in-order (see also QUIC, which avoids this by being UDP based)
  • server push
Basically means the server can pre-emptively send responses
...though only to prime a browser's cache
...and note that the server has to know precisely what to push -- doing that efficiently can actually be much more complex than you probably think
  • Request/response priorities
e.g. could send send css first, js second, images last


And details like:

  • compresses HTTP headers
...though this helps (only) when they're not trivial
...and primarily applies to request headers, very little on not response headers
arguably mostly useful for some CDNs, but not much else



Some notes:

  • Browsers seem to have chosen to only support the TLS variant(verify)
  • single connection, so can be more sensitive to packet loss (which is essentially head-of-line at TCP level)


https://www.smashingmagazine.com/2017/04/guide-http2-server-push/


HTTP/2 is now supported


QUIC

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


QUIC is an always-encrypted transfer,

acting mostly like TCP+TLS,
yet implemented on UDP


It's a general-purpose transport, and has a design similar to HTTP/2(verify)


Upsides:

  • always encrypted and authenticated
  • faster initial encryption setup than with TLS (in part because encryption was designed into the protocol, not wrapped around it)
  • does connection multiplexing


Downsides:

  • because it's sort of imitating TCP over UDP, firewalling is harder
  • more complex to set up


There are two of 'em now, google QUIC and IETF QUIC, which have diverged enough to be considered separate beasts, though hopefully the two will converge again.


https://en.wikipedia.org/wiki/QUIC

HTTP/3

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Basically HTTP over QUIC, and solves the issue of head of line blocking under packet loss (well, improves the recovery).


https://en.wikipedia.org/wiki/HTTP/3

AOTW has only partial support