Escaping and delimiting notes

From Helpful
Jump to: navigation, search
Related to web development, hosting, and such: (See also the webdev category)
jQuery: Introduction, some basics, examples · plugin notes · unsorted
node

Server stuff:

Dynamic server stuff:

Unsorted

URI parsing, escaping, and some related concepts

(See also Uniform Resource Somethings)

Percent escaping (URL encoding)

Percent encoding, also often URL encoding, refers to
%hh
style (zero-padded two-character hexadecimal) for bytestrings where necessary.


Doing all of these (particularly the last) lets you dump arbitrary data into URLs, into XML/HTML attributes, and into HTML POST bodies (more specifically, application/x-www-form-urlencoded data).


For example,

foo=bar quu

becomes

foo%3Dbar%20quu

...or (see notes below)

foo%3Dbar+quu 


In contexts that do percent encoding, it is typically

  • required for byte values 128 through 255
  • either required or a a good idea for non-printable characters
mostly meaning the C0 control codes: 0x00 through 0x1F, and 0x7F
which includes whitespaces, which in many cases you'ld indeed want escaped instead of literal
  • often a good idea for most symbols (because between URLs and markup languages, most are significant as delimiters)
  • allowed for any character - but for many (at least alphanumerics) it's just waste space



On exceptions.

While it's good to escape all symbols, there are specific exceptions.

For example, when constructing different parts of a URL from code, different details apply to different parts.

Probably the most common case is safely dumping arbitrary bytestrings, or URL, into a URL variable in a way that will make it past URL syntax parsing unscathed.

It matters that different parts of the URL have different rules, and space is a special case.

See notes below.


Note that

  • browsers tend to be forgiving about various technically-incorrect things
  • this isn't very good normalization, e.g. URL comparison is sort of hard, as is thinking of all the possible weird edge cases. There have been many URL (de)coding bugs and a few exploits because of this.


Reserved characters

Reserved characters refer to those that have special meaning (primarily) to URL parsing.

You would percent-escape these when you want to dump these as non-delimiting values into URLs.

People do occasionally forget this, and this isn't always an issue .


The set of of reserved characters changed over time. From recent to older:

  • ;/?:@&=+$,!*'()%#[]
    according to RFC 3986 (2005), 'Uniform Resource Identifier (URI): Generic Syntax'
  • ;/?:@&=+$,
    according to RFC 2396 (1998), 'Uniform Resource Identifiers (URI): Generic Syntax'
  • ;/?:@&=
    according to RFC 1738 (1994), 'Uniform Resource Locators (URL)' (same set mentioned in RFC 1808)


The best known special case is that space can be %20 or +.

  • RFC 1866 (HTML2) standardized the use of + as an alternative in application/x-www-form-urlencoded data
8.2.1 mentions "space characters are replaced by `+', and then reserved characters are escaped as per" ...
...it then also uses the + in example URLs elsewhere, without formalizing it, which has led to some confusion (verify)


TODO: dig through further standards, but it seems that:

  • parsing a query string (the part after
    ?
    ) should turn both %20 and + into space (because W3C says so(verify))
and since everything should parse that way, it's safe to insert + there (at least until it gets formally standardized away)
  • URL parts before the query (usually primarily the path) a should use only %20 (so that a + can be left unencoded?) (verify)
  • in practice
    • most non-browser code is likely to not treat URL parts differently, because most of us are too lazy and/or don't know about this
    • there are other bordercases
    • your language may have different functions, to disginguish between + and %20, and/or to control what not to encode.

Unicode

Since URLs can be considered to carry bytes, unicode first needs to become bytes in the form of %hh

At least, that's by far the most common way to put bytes in URLs, following RFC2718, section 2.2.5 (also seems the default in the IRI (Internationalized Resource Identifiers) draft), which also also the path of least bother.


As to text coding, UTF8 is a good default because of its wide applicability -- but note that some older product-specific code has been known to use cp1252 and/or latin1 or such


(There is also a %uhhhh format mentioned in the IRI draft, but apparently not in a decisive way. It isn't standard, and seems unlikely to become one.)

URL parsing, GET query strings, POST data

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

We often parse URLs as structures like:

<scheme>://<net_loc>/<abs_path>?<query>#<fragment>
<scheme>://<net_loc>/<abs_path>;<object_params>?<query>#<fragment>

The parts (according to RFC 1808):

  • scheme ":"     scheme name - RFC 1738, section 2.1
  • "//" net_loc   network location and login information RFC 1738, section 3.1
  • "/" path       URL path - RFC 1738, section 3.1
  • ";" params     object parameters - RFC 1738, section 3.2.2, e.g. ";type=a" (used in FTP, occasionally in HTTP URLs)
  • "?" query      query information - RFC 1738, section 3.3, e.g. "?a=b&c=d
  • "#" fragment   fragment identifier


(TODO: add the newer styles as well)


Most language libraries have some functions that splits a URL into these RFC-defined parts, often without any other decoding (such as decoding percent escapes or splitting variables).

Such a function may take:

'http://example.com/goo;type=a;s=search?foo=bar&quu=%3d#help'

And respond with:

'http'
'example.com'
'/goo'
'type=a;s=search'
'foo=bar&quu=%3d'
'help'

Note that such splitters:

  • ...may also parse netloc's parts (username, password, hostname, and port)
  • ...may not be aware of object parameters, and leave them as part of the path.


There are usually different functions to parse specific parts of a URL, such as parsing out variable-value maps out of the query string, which happens according to application/x-www-form-urlencoded, which can be summarized as:

  • name/value pairs are split with &
  • name is split from value with =
  • names and values are escaped:
    • space is replaced with +
    • reserved characters (and non-ASCII bytes) are percent-encoded
  • form control names/values are listed in the order they appear in the document



HTML forms are often encoded either into the URL's query, or placed into the data area of a HTTP POST.

When using POST, basic HTML forms are often sent using the application/x-www-form-urlencoded encoding (the default in HTML standards), which mirrors how a GET request would put form elements in the URL (see URL/percent encoding above), but avoids possible problems related to URL length limits.

This is simple to implement, but for binary file uploads this can be rather inefficient (with non-ASCII bytes represented using three ASCII bytes), so there are alternative ways of encoding a POST's data for file uploads, which follow the format of multipart MIME messages:

  • multipart/form-data (see RFC 2388)
  • multipart/mixed (for more complex MIME streams, such as multiple files in a single form entry -- which at least browsers generally avoid doing(verify))


See also:


Note that while var/vals in POST and var/vals in GET are separate things, most web-aware scripting makes the basic interface to them transparent, so that you can pick from either one (and which also lets the handler support both GET and POST).

You can usually get at the separated values and/or the original data.


URN parsing

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A URN looks like:

"urn:" <namespace> ":" <namespace_specific_string>

For example:

urn:isbn:0451450523
urn:ietf:rfc:2648
urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66

urn: is regularly omitted.

You can only parse from left to right, and only with knowledge of how each namespace works, since the parsing of the value is entirely namespace-specific.


Since most URN-using applications deal with very few namespaces, you can usually do no more than split on the first : to separate top-level namespace from the value at that level.

Practical URN parsing systems will often either hardcode what they know, or allow registration/hooks for handlers per namespace.


There is an official URN namespace registry, at http://www.iana.org/assignments/urn-namespaces/

Your self-defined private-use namespaces should be prepended with X- (similar to unregistered MIME types), to signal them as experimental. For more global use, you should register your namespace.

Notes:

  • The urn:ietf:rfc:2648 example isn't really namespace nesting as far as the URN standard is concerned, but such nesting conventions may be very practical (and in this specific case, is actually formalized, in RFC 2648)
  • The "urn:" part is sometimes left off of URNs.

Entities within (X)HTML

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you produce or consume HTML or XML, you want to know how to write values in a way that avoid messing up (strict) parsing.


  • Text nodes
    • <
      &lt;
      and
      >
      &gt;
      , to avoid interpretation as tags (which would cause things to magically disappear, become invalid, and in some cases allows injection of nasty things).
    • &
      &amp;
      (because entities will be parsed in both text nodes and in attributes)
  • Attribute text
    • &
      &amp;
      (as HTML parses entities in both text nodes and in attributes)
    • should have quotes replaced, at least
      "
      with
      &#x22;
      or
      &#34;
      , or
      &quot;
      , to avoid nasty parse/injection cases.
    • Replacing
      '
      with
      &#x27;
      or
      &#39;
      is generally also a good idea as some standards (e.g XML) allow attribute delimiting using single quotes. Note that using the &apos; named entity is invalid for some HTML variants, but valid for XML/XHTML. While browsers will be robust to both, strict parsers trip over apos
you could either only do it when you know the attribute uses ' (most people seem to use "), or always.


Further notes:

  • you can always escape each charater, a proper parser has to decode it anyway. The maiifference between
    &#79;
    for
    a
    and
    &#39;
    for
    '
    is that the first is pretty much never necessary and the latter can be in various contexts.
also it's a lot more bytes, so people try to avoid the truly unnecessary escapes
  • URLs have no special status in most places and must follow text/attribute encoding. In the case of attributes (href, src, and such), this means dumping them in is quickly invalid, most commonly because of
    &
  • Using a function that is often called something like "HTML escape" or "CGI escape" solves most entity-related problems, replacing at least
    &<>
    with entities. It may or may not also do
    "
    and/or
    '
    , so check before relying on it.
  • &gt; is not strictly necessary, in node text nor attributes, as parsers usually only look for < as a tag-significant character, but it's consistent and readable to do both.


On control codes and arbitrary binary data in XML

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Short story: don't expect this to just work.

Short suggestion: When in doubt about data serialization, use some escaping (that escapes to all-safe characters, which is most of them, so ).


XML has details pertaining to representing control codes (note: not related to containing them on a binary level; UTF-16 for western text contains a lot of 0x00 bytes)

These carry through to XHTML, and note that XHTML has some additional restrictions.


Of the C0 control codes (U+00 through U+1F):

  • HTML and XML allow common whitespace characters CR (U+0D), NL (U+0A), and Tab (U+09), both literally and as escaped character references(verify)
  • XML 1.0 does not allow any other C0 control codes.
  • XML 1.1 allows use of U+01 through U+1F (note: not U+00). These new characters are required to appear as escaped character references(verify)


C1 control codes (U+80 through U+9F) and DEL (U+7F):

  • XHTML 1.0 does not allow them
  • XML 1.0 allows them literally and as escaped character references
  • XML 1.1 requires them to appear as escaped character references (the document specifically excludes them, even if the char definition includes them as in XML 1.0), except:
  • XML 1.1 allows only U+85 (NEL) to appear in both forms. It seems it will be translated to a line feed on parsing(verify).


This may be as the standards tell it (though this list hasn't been checked for completeness or even correctness), but parsers may have their own, often slightly simpler view. It's generally best to avoid control codes to be safe.


XML cannot contain raw binary data. To programmers, this means you can't use it as a serialization format for arbitrary data without applying an encoding step. Also, you strictly speaking should not assume XML does 1-to-1 serialization-deserialization for data (or even filesystem paths, as some filesystems allow a wide set of characters than others), for these and further reasons. (Another is that XML 1.1 parsers may transparently apply Unicode canonicalization).

Binary-to-ASCII codings are the simplest workarounds. URL percent escaping may be the easiest to use, Base64 another, but there is no pure-XML or easy XSL-compatible solution, and few general-purpose solutions are storage-efficient.


See also:

CDATA, PCDATA, CDATA section

(Partially related to URI escaping and such)

  • CDATA, 'character data', is a minimally parsed string, for example is used in XML attributes.
    • Special meaning to parsers: any of
      &<'
      , and in attributes also the the delimiter,
      "
      .
    • Note that an & points to entities so a literal & must be encoded as &amp;, something often forgotten in a href attributes.


  • PCDATA, as used in XML, denotes the concept that a tag may contain text nodes (which may contain entities that should be parsed)(verify), or text nodes mixed with elements (that should be parsed).
    • Special meaning to parsers: any of
      &<
      , (verify) should be escaped if meant literally.


  • CDATA sections have little directly to do with CDATA. The relation is that they too can contain (almost) any character data.
    • Main use/point: Include sections to exclude from page parsing (though not from being parsed itself; see [1]). This is also interesting for strict XML validation.
    • Special meaning to parsers: only the string
      ]]>
      , which cannot appear inside it.
      • If you don't know for sure that string will never occur in the data you want to store, the cleanest solution is probably encoding all your data with some abstraction code - or not using XML.

CDATA sections look like like:

<![CDATA[
 and -- can " contain ' most & anything ] unescaped
 ]]>


In code

Python

Percent escaping for bytestrings, either:

urllib.quote(bytestr,'')
that second parameter means "Escape everything, no exceptions", so also escapes
/
(like JS's encodeURIComponent()). Useful to make values safe no matter what they are.
If instead of "add these things safely to existing URL string" you do "this is an URL I added things to, take the whole thing and make text safer" (not the best approach but sometimes enough), consider something like urllib.quote(bytestr,':/;?') (imitation of JS's encodeURI, so result can still be an URL)

For unicode, you usually you want to hand in unistring.encode('utf8') to the above


Percent escaping dicts or sequences(verify) into an &'d result is easiest via urllib.urlencode()

urllib.urlencode( {'v1':'Q3=32', 'v2':'===='})      ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
urllib.urlencode( [('v1','Q3=32') ,('v2','====')])  ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
note that urllib.urlencode() is mostly a little wrapping around quote_plus(), and you can do the same yourself.
Percent escaping values for POST body i.e.
application/x-www-form-urlencoded
style
comes down to "the parts of a dict/list &'d together" case
using urllib.urlencode() is often easiest (since you usually have a bunch of variables)

(For the last two, dealing with unicode means you may want some helper functions(verify))


Text into HTML/XML text nodes i.e. entities for <, >, &:

cgi.escape("This & That <3") == 'This &amp; That &lt;3'


Text into HTML/XML attributes

cgi.escape(s).replace('"','&#x22;').replace("'",'&#x27;')


Notes:

  • I like to have helper functions named
    escape.nodetext()
    , and
    escape.attr()
    (and
    escape.uri_component()
    )
making it easier to skim code for whether it's conceptually doing the right thing.
  • you could also use &quot; instead of &#x22; / &#34; most of the time (defined in at least HTML 3.2, HTML 2, HTML 4 and 5, and XHTML 1.0, XML 1.0, and more. Not in quite everything(verify), yet in everything that you'll be using these days. In fact, cgi.escape(s, quote=True), likely a slightly faster call than the above, uses &quot;)
  • Two ' related issues
    • the named entity &apos; is defined in XML, and e.g. HTML5, but not HTML4[2]. And some browsers are less forgiving about this, so a generic escaping function should use &#39; / &#x27; instead of &apos;
    • various XML, HTML allow wrapping attributes in either " or '
basic cgi.escape doesn't escape ' to &apos; so would break if you wrap in '
this isn't usually an issue, for entirely practical reasons: most people type out " if they're writing their own HTML, and if you use a templating thing it'll typically be doing the escaping for you (correctly)
...yet since it's sometimes safer, and as valid to entity-ize ' as any character, you might as well
  • The above does not care about unicode, because it assumes you (or your framework) are working in unicode, and translating it to byte coding at the edges of the app (note this relies on string functions passing through unicode as-is, which is typically true in python).
...rather than making sure each inserted string is correctly byte-coded, because that's usually more work (but equally valid if you do it consistently, and there are a few cases where it's handier - or even the only option, e.g. in PHP(verify)).












Javascript

Percent escaping

Since javascript 1.5: (TODO: summarize situation before javascript 1.5)

encodeURIComponent()
  • percent-escape all reserved characters (point being that you can transport most anything in an URL value (even if it's an URL itself))
  • escapes Unicode as %-encoded UTF-8 bytes
encodeURI()
  • does not percent-escape ":", "/", ";", and "?" (point being that the result is still formatted/usable as a URL)


There is also
escape()
but I avoid it:
  • Of the reserved characters, escape does not escape *@+/, but does escape the rest(verify)
It escapes escapes Unicode characters in a non-standard javascript-specific way (e.g. U+2222 encodes as %u2222), which most servers will not understand
Note however that you can abuse it to have the net effect of only UTF8 encoding/decoding, by combining it with the above. In the following, the percent part cancels out:
function encode_utf8(s) { return unescape(encodeURIComponent(s)); }​
and
function decode_utf8(s) { return decodeURIComponent(escape(s)); }​