Uniform Resource Somethings

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Related to web development, lower level hosting, and such: (See also the webdev category)

Lower levels


Server stuff:


Higher levels


These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

(Note these are sometimes misnamed Universal resource somethings.)

Basic concepts

URL

Uniform Resource Locators (URLs) specify where to locate something. This often implies or at least suggests its availability.


URN

Uniform Resource Names (URNs) are what - usually identifiers (or names) of content or concepts. Typically with namespaces to signal what sort of thing you are identifying. For example, you might find someone using isbn:0898156122 to identify a book.

URNs are sometimes prefixed with urn: to clarify a URN is being used, though this is not considered part of the URN itself.


There are a few registered namespaces, see [1]. Having an official definition is useful for resolvers to do standardized things. There's nothing stopping you from inventing your own for a specific use.

URI

Uniform Resource Identifiers (URIs) may be either an URN, an URL, or in some cases be both (such as 'look up this identifier at a specific service').

On the internet, almost all URIs are specifically URLs.

There are a few specific real-world contexts where both URLs and URNs are used.

Also

IRI

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Internationalized Resource Identifier (RFC 3987, 2005) extends URIs with Unicode.


It amounts to percent encoding UTF8 bytes. Which we had been doing for a while anyway, but it's nice to have a standard, some restrictions.


https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier

XRI

IDN

See also


  • RFC 3986: Uniform Resource Identifiers (URI) (updates: RFC 1738, obsoletes: RFC 1808, RFC 2396, RFC 2732)


  • RFC 2141: Uniform Resource Names (URN)
  • RFC 3406: Uniform Resource Names (URN) Namespace Definition Mechanisms


  • RFC 1738: Uniform Resource Locators (URL)


And also:

  • RFC 3987: Internationalized Resource Identifier (IRI) generalizes URIs for Unicode, and defines a mapping from IRI to URI syntax.


  • PURL usually refers to URLs that redirect to and are more persistent than the end content, rather like e.g. DOI.


In practice

Maximum URL length

URL length is considered limited in HTTP requests, partially to avoid DoS attacks that ties up resources in URL parsing (though not really to receiving that data, and note that the DoS argument also applies to using large requests in other ways, e.g. large POSTed attachments)

...so servers may deem a request silly and send a "413 Entity Too Large" or "414 Request-URI Too Long" instead of handling that request.


I've seen quoted figures like 4kB, 8kB (8190 for Apache, see LimitRequestLine), 16kB (documented for IIS), and 32kB. This has probably changed(verify)).

You can generally assume you can get away with ~2000 characters, which is high enough for most useful things.

If you're sending serious data in the URL, you should be able to explain why you are not POSTing the data instead.


Note that most uses for very long URLs are for data in the request line, which can go into a POST body instead. And often should anyway, as this has better semantics when used to update server state, e.g. that browser will not request it without explicit interaction, and most spiders won't do so at all.


Notes on standards and real-world implementations:

  • RFC 2616 (HTTP 1.1) does not specify an upper limit
It does mention that some very old (proxy) software might not support lengths above 255, though this is an estimate and a very cautious one at that.
  • In earlier Internet Explorer versions (which?(verify)), URIs can be no longer than 2083 characters, of which at most 2048 can be the path
  • Browsers tend to have no apparent limits, or more to the point, limits higher than most server limits[2]
  • servers/frameworks may impose their own limits, which are sometimes configurable, and also regularly not.


On a related note, POST body length limits are not mentioned in standards, though in practice are often limited by server configuration, or implementation. For example, things that load POST into memory for speed may limit it to avoid self-DOSing.

Defaults may be on the order of a handful of MBs (see e.g. PHP and nginx defaults),
maxima may be on the order of GBs (assume 2GB server-side, as in sint32max, though it can be more),
in theory you could use figures up to host RAM, or beyond if you make it stream to storage.

URI parsing, escaping, valid characters

See Escaping_and_delimiting_notes#URI_parsing.2C_escaping.2C_and_some_related_concepts



On PATH_INFO

See also CGI,_FastCGI,_SCGI,_WSGI,_servlets_and_such#CGI_variables.2C_HTTP_variables.2C_and_similar


With or without the CGI-style split/movement between SCRIPT_NAME and PATH_INFO, there are a few cases for PATH_INFO:

  • empty string
    • if at root, or if it maps to a directory or virtual directory, servers will usually send an external redirect to the same URL with an added slash
    • browsers may already add it, figuring they'll save you a few hunred milliseconds.
  • not an empty string, in which case it must start with a slash to be valid. It can be just a slash, or a longer path.


It depends on server configuration whether it then...

  • maps to a directory - when it is a slash or ends with a slash, and the server config maps it to a filesystem/virtual directory. Often means an index serve.
  • leads to a redirect to add a final slash so that the above case applies more cleanly (e.g. apache does this, based on its configuration)
  • hooks into a module / dynamic app, in which case control is handed over and further treatment is completely up to it. Usually happens after PATH_INFO and SCRIPT_NAME are altered to reflect app path mounting.



Omitting scheme