Uniform Resource Somethings
| Related to web development, lower level hosting, and such: (See also the webdev category)
| These are primarily notes|
It won't be complete in any sense.
It exists to contain fragments of useful information.
(Note these are sometimes misnamed Universal resource somethings.)
Uniform Resource Locators (URLs) specify where to locate something. This often implies or at least suggests its availability.
Uniform Resource Names (URNs) are what - usually identifiers (or names) of content or concepts. Typically with namespaces to signal what sort of thing you are identifying. For example, you might find someone using isbn:0898156122 to identify a book.
URNs are sometimes prefixed with urn: to clarify a URN is being used, though this is not considered part of the URN itself.
There are a few registered namespaces, see . Having an official definition is useful for resolvers to do standardized things. There's nothing stopping you from inventing your own for a specific use.
Uniform Resource Identifiers (URIs) encompass the two: An URI may be either an URN, an URL, or in some cases be both (such as 'look up this identifier at a specific service').
On the internet, almost all URIs are specifically URLs.
There is few more specific real-world contexts where both URLs and URNs are used.
- RFC 3986: Uniform Resource Identifiers (URI) (updates: RFC 1738, obsoletes: RFC 1808, RFC 2396, RFC 2732)
- RFC 2141: Uniform Resource Names (URN)
- RFC 3406: Uniform Resource Names (URN) Namespace Definition Mechanisms
- RFC 1738: Uniform Resource Locators (URL)
- RFC 3987: Internationalized Resource Identifier (IRI) generalizes URIs for Unicode, and defines a mapping from IRI to URI syntax.
- PURL usually refers to URLs that redirect to and are more persistent than the end content, rather like e.g. DOI.
URL length is considered limited in HTTP requests, partially to avoid DoS attacks that ties up resources in URL parsing (though not erally to receiving that data, and note that the DoS argument also applies to using large requests in other ways, e.g. large POSTed attachments), so servers may deem a request silly and send a "413 Entity Too Large" or "414 Request-URI Too Long" instead of doing something useful.
You can generally assume you can easily get away with ~2000 characters, which is high enough for most useful things.
If you're sending serious data in the URL, you should be able to explain why you are not POSTing the data instead.
Note that most uses for very long URLs are for data in the request line, which can go into a POST body instead. And often should anyway, as this has better semantics when used to update server state, e.g. that browser will not request it without explicit interaction, and most spiders won't do so at all.
Notes on standards and real-world implementations:
- RFC 2616 (HTTP 1.1) does not specify an upper limit
- It does mention that some very old (proxy) software might not support lengths above 255, though this is an estimate and a very cautious one at that.
- In earlier Internet Explorer versions (which?(verify)), URIs can be no longer than 2083 characters, of which at most 2048 can be the path
- Browsers tend to have no apparent limits, or more to the point, limits higher than most server limits
- servers/frameworks may impose their own limits, which are sometimes configurable, and also regularly not.
On a related note, POST body length limits are not mentioned in standards, though in practice are often limited by server configuration, or implementation. For example, things that load POST into memory for speed may limit it to avoid self-DOSing.
- Defaults may be on the order of a handful of MBs (see e.g. PHP and nginx defaults),
- maxima may be on the order of GBs (assume 2GB server-side, as in sint32max, though it can be more),
- in theory you could use figures up to host RAM, or beyond if you make it stream to storage.
URI parsing, escaping, valid characters
- On PATH_INFO
With or without the CGI-style split/movement between SCRIPT_NAME and PATH_INFO, there are a few cases for PATH_INFO:
- empty string
- if at root, or if it maps to a directory or virtual directory, servers will usually send an external redirect to the same URL with an added slash
- browsers may already add it, figuring they'll save you a few hunred milliseconds.
- not an empty string, in which case it must start with a slash to be valid. It can be just a slash, or a longer path.
It depends on server configuration whether it then...
- maps to a directory - when it is a slash or ends with a slash, and the server config maps it to a filesystem/virtual directory. Often means an index serve.
- leads to a redirect to add a final slash so that the above case applies more cleanly (e.g. apache does this, based on its configuration)
- hooks into a module / dynamic app, in which case control is handed over and further treatment is completely up to it. Usually happens after PATH_INFO and SCRIPT_NAME are altered to reflect app path mounting.