Escaping and delimiting notes

From Helpful
(Redirected from Percent encoding)
Jump to navigation Jump to search
Related to web development, lower level hosting, and such: (See also the webdev category)

Lower levels


Server stuff:


Higher levels


Attempt at some clarifying / some distinctions

On delimiting

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Delimiting is using one or more characters/bytes to signify a boundary between separate regions, in text or in data.

Examples include:

a comma in text files, roughly the definition of CSV ('comma separated values')
a tab in the fairly-analogous tab separated value (TSV) files
a colon in thing like passwd files
a semicolon
and further flavours of the same idea


And in a wider sense also

HTTP splitting header and body at the first appearance of two CRLFs
MIME's multipart delimiter (see e.g. RFC 1341 [1])
a pause in morse code


The main upside and downside, is that these will always and only have that meaning.


Upside, because that parsing is simple and unambiguous: all instances of that delimiting character/sequence will mean a split, there is no context in which it can mean other things. Code never has to consider or store anything about what characters came before.

The main downside is that the data must never contain these characters/sequences. As such, when you produce such data, you must take care to remove the delimiter from your data. If you do not, it may later be hard or impossible (or, occasionally, easy) to figure out which were meant to separate and which not.

There are in fact many examples you can think of that are mostly but not quite delimited, in that if you were to tokenize them with what you think is the delimiter you will occasionally do incorrect things. Say, delimiting programming instructions with semicolons and newlines? ...well, both might appear in string literals - it's escaped, really.


Nesting one kind of delimiting in another may be nonambiguous - but only when you know or control both delimiters. Also, this makes parsing fairly custom - whereas e.g. TSV and CSV are relatively standard.


Alternatives:

  • escaping
  • describe length of the string that follows. Examples include netstrings, HTTP chunked coding




Examples - delimiting

probably the simplest example of delimiting would be CSV - comma separated values. A file containing
1,2,7,4,3
1.2,2.1,7.1,4.5,3.6

is easy to interpret as a matrix or spreadsheet.

And unambiguously so when it only ever contains numbers. ...well, until you meet locales. Say, various non-English languages use . for digit grouping, as a decimal separator, rather than the other way around. This breaks CSV immediately.


TSV

TSV is exactly the same idea as CSV, but uses the TAB character (0x09) instead of a comma, to avoid the above problem, and to make it easier to


Also, CSV becomes more peculiar if you also want to store text, or anything else that may indeed contain commas (or maybe numbers in countries that use , as a decimal separator) - and there are CSV files that are not purely delimited. Particularly spreadsheet programs exporting to CVS tend to make life harder on code reading this again.

An alternative then is TSV: Tab Separated Values. In theory it only moves the problem, but the tab character (0x09) is simply much rarer to see in outside of free-form text.


In theory you could choose things even less likely to be in text, like ASCII FS, GS, RS, US [2] (intended for data, but not used much in the real world).


When CSV isn't just delimited

The most basic form of CSV delimits records with newlines, and fields with commas.


In particular spreadsheet programs (Microsoft, really) have tried to fix the "CSV data can't contain comma" problem by creating some relatively custom variants of CSV. Say, it allows:

"1.2","3,4",5

This led to more variant implementations.

...which never quite do the same thing. Yes, there is a RFC 4180 - except that unlike most RFCs, it specifically mentions that it does not specify an internet standard of any kind, and is intended as a description of how most people seem to do it. (That is, it describes the basics, plus what Excel does)


If a field starts with " it means it is enclosed by doublequotes. If not, it may not contain double quotes.

A quoted field may contain

line breaks
a literal ", by doubling it


Note that this is no longer a format purely delimited by newlines and commas, it is now an escaped format as both commas and newlines have varied meaning.


Notes:

  • Such spreadsheet programs will generally whether to a text file as delimited in such a way, and what it is delimited with (to also support TSV and other variations)
  • with edge cases. For example recent versions of Excel will only ask about importing if the extension is .txt or such - it specifically does not ask anymore when the extension is .csv
  • Excel listens to localization, so numbers are formatted differently depending on where it was saved anyway.
it seems to actually refuse to load specific cases, so it seems they broke portability even in their own product?
  • in practice you may rely on specifying what specific dialect you're reading (or on trying to detect it)

-->

On escapes

Escape characters

On control codes and control sequences

Special meanings to serial lines, modems, text terminals, shells

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Control sequences are generally understood to be characters that alter device state instead of being printed as a character.

These are often non-printable characters, though not always - consider e.g. Hayes/AT style modem commands.


They are frequently what ASCII considers control codes - values 0x00 through 0x1F

On things like serial lines (which includes terminals and modems) there is no separate control channel, so you have to do in-band signaling of any such control.

You would mostly send just literal text/data, but also intersperse commands, toggling between the literal and commands two with special characters.

(And yes, this means that sending arbitrary data through such a line is more involved, and is the source of various conventions and protocols).



When 'terminal' meant "automated typewriter", there were few control codes, but once it started meaning "something with a screen" we got a bunch more.

For example, look at ANSI escape codes, and ECMA-48, "Control Functions for Coded Character Sets" (note there are also a bunch of privateish extensions to this in practice).


These are sequences of bytes, and a lot of these cases start with the byte value 0x1B (you'll also see it as \033 in octal, 27 in decimal), which is what the Esc key emits (roughly why you can could these escape sequences).

The length of the sequences varies with what's in there.

A lot of cases is escape character, subset, (optional parameters), command character,
where that command also implicitly terminates the escape


Within the ANSI set (the real world of 'any terminal ever' is a lot messier) there is a loose grouping of these commands into types. For example,

  • DCS, "Device Control String"
Start with ESC P (0x1b 0x50)
things like fetching capabilities (verify)
  • OSC, "Operating System Command"
Start with ESC ] (0x1b 0x5d)
things like window title
  • CSI, "Control Sequence Introducer"
Start with ESC [ (0x1b 0x5b)
more expandable, e.g. with ;-separated arguments
clear screen, cursor and control stuff
https://en.wikipedia.org/wiki/ANSI_escape_code#CSI_sequences
includes SGR, "Select Graphic Rendition"
which is typically used for color/bold/blink type stuff
all terminated by m
https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_(Select_Graphic_Rendition)_parameters



See also:

ANSI color

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Color in shells is mostly a subset of SGR escapes (see also mention above)


Most of which are in the form:

ESC [ textspec m


Some of the more interesting and better-supported codes (search for Select Graphic Rendition)

  • 1: bright foreground color
bright variants can be shorthanded
  • 0 clears to default colors/no bright/underline(verify)
useful to start a sequence, because by default you only change the current state
also useful in itself, e.g. to stick at the end of a colored prompt
  • 4: set underline
  • selective clearing:
39 default foreground color (default color decided by the terminal itself(verify))
49 default foreground color
22 clears bold
24 clears underline

The classical set of colors:

  • 30 foreground: black (bright+black is how you get dark gray),
  • 31 foreground: red
  • 32 foreground: green
  • 33 foreground: yellow (non-bright yellow is brown or orange on some schemes),
  • 34 foreground: blue
  • 35 foreground: magenta
  • 36 foreground: cyan
  • 37 foreground: light grey (bright+lightgrey is how you get real white)
  • bright foreground colors can be shorthanded: 90..97 are effectively shorthand for 1;30 .. 1;37 (verify)
  • 40 background: black
  • 41 background: red
  • 42 background: green
  • 43 background: yellow
  • 44 background: blue
  • 45 background: magenta
  • 46 background: cyan
  • 47 background: light grey
note that bright applies only to foreground


Note that you can specify multiple codes at once, separated by ;, before the m

So, for example:

  • 35 sets the foreground to magenta
  • 37;44 is grey on blue
  • 1;37;44 is white on blue
  • 1;33;45 is bright yellow on magenta
  • 0;35 clears foreground and background flags and colors, then sets the foreground to magenta
  • 0;1;35 same but with bright magenta
  • 0;1;4;35 same but also underlined
  • 1;47;30 is grey (bright black) on white

To play around in a prompt, try something like:

printf "\x1b[0;1;33;45mtest\x1b[0m\n"


See also:


More colors

Modern terminals may also have 256-color and true color modes.

This is usually advertised via TERM, but there's a bunch of footnotes to that (TODO: write them up)

Assuming you've detecting it'll work, it's basically just another color-set-code with a new set of possible alues squeezed within the regular SGR [ and m

38 set foreground color with one of these newer codes
48 set background color with one of these newer codes


Followed by either

  • for 256 color': ;5;n where n is
0..7: (like 30..37): black red green yellow blue magenta cyan gray
8..15: (like 90..97, the bright variants of the above)
16..231: 6×6×6 rgb cube. To calculate:
multiply red (range 0..5) by 36,
multiply green (range 0..5) by 6,
use blue (range 0..5) as-is
and add 16 to put it in this range
232..255: 24 shades of gray


For example, to test support you could do:

for i in {0..255} ; do
    printf "\x1b[38;5;${i}mcol${i}\x1b[0m\t"
done; echo



  • for truecolor: ;2;r;g;b
where r, g, and b can each be 0..255

Supporting terms: seem to include xterm, konsole, libvte derived including gnome's

e.g.

for i in {0..255} ; do
    printf "\x1b[38;2;100;100;${i}m100,100,${i}\x1b[0m\n"
done




What happens with these when then terminal doesn't support it?

In general, it will be ignored.

Some seem to try the closest supported color.


See also:

More practical: preparing for a specific parser

Preparing a string to hand it to a known parser, to ensure the data that comes out of that parser is what you intended.

And yes, this has overlap with serialization, but is often a l


URLs

Percent encoding (URL encoding)

Percent encoding / percent-escaping, also often URL encoding, refers to %hh style (zero-padded two-character hexadecimal) for bytestrings where necessary.


This is one way of dumping arbitrary data into URL variables. and is the classical way of doing the same in a form submission (more specifically: HTML POST bodies according to application/x-www-form-urlencoded data)


For example,

foo=bar quu

becomes

foo%3Dbar%20quu

...or (see notes below)

foo%3Dbar+quu 


In contexts that do percent encoding, it is typically

  • required for byte values 128 through 255
  • required for non-printable characters, e.g. RFC 1738 says:
required for 0x00 through 0x1F (0..31), the C0 control codes
which includes whitespaces, which in many cases you'ld indeed want escaped instead of literal (though in some cases you can get away with not doing some)
and 0x7F (127)
  • often a good idea for most symbols
not required, but between URLs and markup languages, a lot of symbols are significant as delimiters, and you can sometimes carry data through with less headache by percent-escaping them
see the reserved section below
  • allowed for any character - but for many (at least alphanumerics) it's just waste space



On exceptions.

While it's good to escape all symbols, there are specific exceptions.

For example, when constructing different parts of a URL from code, different details apply to different parts.

Probably the most common case is safely dumping arbitrary bytestrings, or URL, into a URL variable in a way that will make it past URL syntax parsing unscathed.

It matters that different parts of the URL have different rules, and space is a special case.

See notes below.


Note that

  • browsers tend to be forgiving about various technically-incorrect things
  • this isn't very good normalization, e.g. URL comparison is sort of hard, as is thinking of all the possible weird edge cases. There have been many URL (de)coding bugs and a few exploits because of this.


Reserved characters

Reserved characters refer to those that have special meaning (primarily) to URL parsing.

You would percent-escape these when you want to dump these as non-delimiting values into URLs.

People do occasionally forget this. Which isn't always an issue.


The set of of reserved characters changed over time. From recent to older:

  • ;/?:@&=+$,!*'()%#[] according to RFC 3986 (2005), 'Uniform Resource Identifier (URI): Generic Syntax'
  • ;/?:@&=+$, according to RFC 2396 (1998), 'Uniform Resource Identifiers (URI): Generic Syntax'
  • ;/?:@&= according to RFC 1738 (1994), 'Uniform Resource Locators (URL)' (same set mentioned in RFC 1808)



The best known special case is that space can be %20 or +.

  • Already mentioned as %20 in RFC 1738
  • RFC 1866 (HTML2) standardized the use of + as an alternative in application/x-www-form-urlencoded data
section 8.2.1 mentions "space characters are replaced by `+', and then reserved characters are escaped as per" ...
...it then also uses the + in example URLs elsewhere, without formalizing it, which has led to some confusion (verify)


This seems to have been done to make URLs a little more readable (and a little shorter?)


TODO: dig through further standards, but it seems that:

  • parsing a query string (the part of an URL after ?) should turn both %20 and + into space (because W3C says so(verify))
and since everything should parse that way, it's safe to insert + there
  • URL parts before the query (usually primarily the path) a should use only %20 (verify)
(so that a + can be left unencoded?) (verify)
  • in practice
    • most non-browser code is likely to not treat URL parts differently, because most of us are too lazy and/or don't know about this
    • there are other bordercases
    • your language may have different functions, to disginguish between + and %20, and/or to control what not to encode.
  • so when producing URLs with space, it may be less confusing and less edge-casey to just use %20 everywhere


Unicode

Since URLs can be considered to carry bytes, unicode first needs to become bytes, so that we can then put it in there %hh style.

At least, that's by far the most common way to put bytes in URLs, following RFC2718, section 2.2.5 (also seems the default in the IRI (Internationalized Resource Identifiers) draft), which also also the path of least bother.


Codepoints becoming bytes means it first needs to be put through a byte encoding/

For text, UTF8 is a good default because of its wide applicability, covering all of unicode.

Some older product-specific code has been known to use cp1252 and/or latin1 or such


(There is also a %uhhhh format mentioned in the IRI draft, but apparently not in a decisive way. It isn't standard, and seems unlikely to become one.)

application/x-www-form-urlencoded

Text/data into (X)HTML

tl;dr mostly:

  • escaping text that goes into node text (dealing with <>amp;)
  • escaping text that goes into attributes (dealing with <>amp;"')
  • escaping text/data to be included in an URI
see #Percent_escaping_.28URL_encoding.29. If then put within a page, you then pass it through the above (mostly attributes)



If you produce or consume HTML or XML, you want to know how to write values in a way that avoid messing up (strict) parsing.


  • Text nodes
    • <&lt; and >&gt;, to avoid interpretation as tags (which would cause things to magically disappear, become invalid, and in some cases allows injection of nasty things).
    • &&amp; (because entities will be parsed in almost all places, including text nodes and attributes)
  • Attribute text
    • &&amp; (as HTML parses entities in both text nodes and in attributes)
    • should have quotes replaced, at least " with &#x22; or &#34;, or &quot;, to avoid nasty parse/injection cases.
    • Replacing ' with &#x27; or &#39; is generally also a good idea as some standards (e.g XML) allow attribute delimiting using single quotes as well as double quotes. Note that using the &apos; named entity is invalid for HTML before 5, but valid for XML and thereby XHTML. While browsers will be robust to both, strict parsers trip over apos
you could either only do it when you know the attribute uses ' (most people seem to use "), or always.


Further notes:

  • you can always escape each charater, a proper parser has to decode it anyway. The maiifference between &#79; for a and &#39; for ' is that the first is pretty much never necessary and the latter can be in various contexts.
also it's a lot more bytes, so people try to avoid the truly unnecessary escapes
  • URLs have no special status in most places and must follow text/attribute encoding. In the case of attributes (href, src, and such), this means dumping them in is quickly invalid, most commonly because of &
  • Using a function that is often called something like "HTML escape" or "CGI escape" solves most entity-related problems, replacing at least &<> with entities. It may or may not also do " and/or ', so check before relying on it.
  • &gt; is not strictly necessary, in node text nor attributes, as parsers usually only look for < as a tag-significant character, but it's consistent and readable to do both.


XML related notes

On control codes and arbitrary binary data in XML
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Short story: don't expect this to just work.


Suggestion: When in doubt about data serialization, use some escaping.


Control codes

XML has details pertaining to representing control codes (note: not related to containing them on a binary level; UTF-16 for western text contains a lot of 0x00 bytes)

These carry through to XHTML, though note that XHTML has some additional restrictions.


Of the C0 control codes (U+00 through U+1F):

  • HTML and XML allow common whitespace characters CR (U+0D), NL (U+0A), and Tab (U+09), both literally and as escaped character references(verify)
  • XML 1.0 does not allow any other C0 control codes.
  • XML 1.1 allows use of U+01 through U+1F (note: not U+00). These new characters are required to appear as escaped character references(verify)


C1 control codes (U+80 through U+9F) and DEL (U+7F):

  • XHTML 1.0 does not allow them
  • XML 1.0 allows them literally and as escaped character references
  • XML 1.1 requires them to appear as escaped character references (the document specifically excludes them, even if the char definition includes them as in XML 1.0), except:
  • XML 1.1 allows only U+85 (NEL) to appear in both forms. It seems it will be translated to a line feed on parsing(verify).


This may be as the standards tell it (roughly - I'd need to double check).

And many users neither know nor care, so may produce invalid XML.

Except that parsers may have their own, often slightly simpler view, that allow for and possibly correct such abuse.

It's generally best to avoid control codes, just to be safe.


Binary data

XML cannot contain raw binary data.

To programmers, this means either

  • it's not a serialization format at all
  • it's a serialization format that needs a whole bunch of custom wrapping (try SOAP if you're nasty), and is not very efficient

Also, you strictly speaking should not assume XML does 1-to-1 serialization-deserialization for data (or even filesystem paths, as some filesystems allow a wide set of characters than others), for these and a handful of other further reasons. (One other is that XML 1.1 parsers may transparently apply Unicode canonicalization).


Binary-to-text codings are the simplest workarounds.

URL percent escaping may be the easiest to use (with functions almost certainly in your standard library)
Base64 another slightly less space-inefficient one

No such solution is pure-XML, or something XSL can do anything with.


See also:

CDATA, PCDATA, CDATA section
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


CDATA and PCDATA are concepts within XML (and SGML(verify)) parsing.


It's sometimes relevant to understand distinctions between e.g. PCDATA and CDATA (and RCDATA), but that's probably not what you were looking for.


But CDATA technically has a few different meanings, depending on context:

  • In DTDs, attributes can be declared CDATA (rather than ID, NAME, NUMBER, or such)
Some of those alternatives imply interpretation as if uppercased, tokenized
...and CDATA implies 'no such processing'.
(This has no effect on entity parsing, though)
  • In DTDs, entity declaration (external or internal) may or may not say CDATA
if it does, it roughly means that the value of that entity declaration should be taken literally, not parsed (verify)
  • In DTDs, elements may be declared with CDATA contents.
which seems to be one of those "don't do this, unless you can very clearly explain why this is not a bad idea" things.
  • In documents, marked sections are a specific syntax (that can be declared CDATA, RCDATA, and a few others):
<![ CDATA [
and -- can " contain ' most & anything ] unescaped
]]>

...that tells XML parsers to treat that section literally, or rather, as little parsing as you can ask of this sort of parser.


CDATA sections (more fully: marked sections using CDATA) is what most of us were thinking of.

The point of CDATA sections is often that it can includes markup-like characters that you now don't have to escape. The most recognizable use is probably in <script> tags in XHTML, so that you don't have to escape < and & in your script.

(Note that CDATA sections doesn't mean that embedded markup remains untouched. After parsing <, querying the document would still give you &lt;. But that's something for the script parser to deal with(verify). The reason we do this is purely that us coders only ever see/editing at the document in serialized form)


Doing this has no meaning in HTML, where

  • <script> is specifically defined to be parsed that way already
  • HTML parsers officially do not know about this syntax, and will treat that as a literal string. (browsers might be resilient to this in the script tag, just because various coders copy-pasted it as some magic incantation they didn't fully understand)

As such, a CDATA section that may go to both XHTML and HTML could be commented out in the relevant language, like:

<script type="text/javascript">
//<![CDATA[
document.write("<");
//]]>
</script>


CDATA sections are sometimes used to dump more arbitrary data into XML.

If you care about XML validation and strict parsers, there are footnotes to this.

  • as implied above, markup like < may, depending on what consumes it, be turned into its entities.
as such, CDATA never really means 'include verbatim'
  • CDATA sections cannot include the string ]]> - seeing that will end that section regardless of context.
It doesn't seem there's an escape for this, so if you cannot know for sure that string will never occur in the data you want to store, the cleanest solution is probably another layer of encoding that you then put within a CDATA section. Or to not use XML.
  • This also implies nesting CDATA isn't possible, but there are workarounds like stopping and starting CDATA section.
  • CDATA sections may only use characters available in the encoding. This means you may want to escape them - but the other wide will need to know to unescape them.


If you want to dump binary data verbatim, HTML/XML cannot really guarantee this, and you're probably better of using a binary-to-text coding on both ends.


See also:

Shell escaping

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


What?

Crafting a string to be handed into a shell command that will not be interpreted by the shell, where that string might include characters that have special meaning to a shell.

...which you will commonly run into in the form of filenames.


POSIX lists characters with special meaning to the shell[3] as:

"  '  \  $  |  &  ;  <  >  (  ) ` space  tab  newline
*   ?   [   #   ˜   =   %

Some contexts are safer (where e.g. that second line is less troublesome than the first), but we probably want a solution that works for all cases.


Handling of non-ASCII is not defined in POSIX (and control codes, being nonprintable, would be hard to type), yet in practice, any character that doesn't have special meaning to shell parsing can be in there as-is (with some minor worry about getting mangled by something else).



From your own code

Probably the easiest approach to most of them is to put single-quotes around the entire string, which avoids almost all shell processing. The leftover detail is dealing with the presence of ' itself being in the string.

And "A single quote may not occur between single quotes, even when preceded by a backslash."[4], which seems to break this approach.

However, because shells will append strings that aren't separated, you can deal with it in chunks. For example, the string a'b can become 'a'"'"'b', and this style (over 'a'\''b') also avoids backslashes, which can avoid one type of confusion, that of multiple layers of escaping.

As such, this style is probably the easiest to implement, and potentially less fragile too.


Tools like ls seem to follow a similar style. They may also use ANSI C-quoting, $'\022' style (note that many but not all shells support this, not being POSIX) for control bytes (and, apparently, non-ASCII bytes that that aren't valid in the current encoding(verify)), probably so that you can also safely print the result in a shell without risk of that including control codes that the shell interprets and may make it go weird.

This should not be necessary to constructing commands.


https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Quoting

https://stackoverflow.com/questions/15783701/which-characters-need-to-be-escaped-when-using-bash

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_02_01


URN parsing

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

A URN looks like:

"urn:" <namespace> ":" <namespace_specific_string>

For example:

urn:isbn:0451450523
urn:ietf:rfc:2648
urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66

urn: is regularly omitted.


💤 namespacey?

The urn:ietf:rfc:2648 example isn't really namespace nesting as far as the URN standard is concerned, but such nesting conventions may be very practical (and in this specific case is actually formalized, in RFC 2648)

If you do that, you can still only parse from left to right, and each layer basically implies going to a different standard, that you have to know about, and you won't know how the rest of it will encode their data.

Not a lot of people really do this, and if they do, it's only usually kept simple.


So by and large, these are considered identifiers as a whole, and only very specific contexts will actually look at their content.

Most practical URN-using applications deal with very few namespaces in the first place, so will either hardcode what they need to know, or sometimes allow registration/hooks for handlers per namespace.


There is an official URN namespace registry, at http://www.iana.org/assignments/urn-namespaces/

Your self-defined private-use namespaces should be prepended with X-, to signal them as experimental(similar to unregistered MIME types). For more global use, you should register your namespace. (...for app-internal use you can often get away with doing neither)

URI parsing, escaping, and some related concepts (TODO: sort into above)

(See also Uniform Resource Somethings)

Shoving variables into URLs and/or POST data

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

We often parse URLs as structures like:

<scheme>://<net_loc>/<abs_path>?<query>#<fragment>
<scheme>://<net_loc>/<abs_path>;<object_params>?<query>#<fragment>

The parts (according to RFC 1808):

  • scheme ":"     scheme name - RFC 1738, section 2.1
  • "//" net_loc   network location and login information RFC 1738, section 3.1
  • "/" path       URL path - RFC 1738, section 3.1
  • ";" params     object parameters - RFC 1738, section 3.2.2, e.g. ";type=a" (used in FTP, occasionally in HTTP URLs)
  • "?" query      query information - RFC 1738, section 3.3, e.g. "?a=b&c=d
  • "#" fragment   fragment identifier


(TODO: add the newer styles as well)


Most language libraries have some functions that splits a URL into these RFC-defined parts, often without any other decoding (such as decoding percent escapes or splitting variables).

Such a function may take:

'http://example.com/goo;type=a;s=search?foo=bar&quu=%3d#help'

And respond with:

'http'
'example.com'
'/goo'
'type=a;s=search'
'foo=bar&quu=%3d'
'help'

Note that such splitters:

  • ...may also parse netloc's parts (username, password, hostname, and port)
  • ...may not be aware of object parameters, and leave them as part of the path.


There are usually different functions to parse specific parts of a URL, such as parsing out variable-value maps out of the query string, which happens according to application/x-www-form-urlencoded, which can be summarized as:

  • name/value pairs are split with &
  • name is split from value with =
  • names and values are escaped:
    • space is replaced with +
    • reserved characters (and non-ASCII bytes) are percent-encoded
  • form control names/values are listed in the order they appear in the document



HTML forms are often encoded either into the URL's query, or placed into the data area of a HTTP POST.

When using POST, basic HTML forms are often sent using the application/x-www-form-urlencoded encoding (the default in HTML standards), which mirrors how a GET request would put form elements in the URL (see URL/percent encoding above), but avoids possible problems related to URL length limits.

This is simple to implement, but for binary file uploads this can be rather inefficient (with non-ASCII bytes represented using three ASCII bytes), so there are alternative ways of encoding a POST's data for file uploads, which follow the format of multipart MIME messages:

  • multipart/form-data (see RFC 2388)
  • multipart/mixed (for more complex MIME streams, such as multiple files in a single form entry -- which at least browsers generally avoid doing(verify))


See also:


Note that while var/vals in POST and var/vals in GET are separate things, most web-aware scripting makes the basic interface to them transparent, so that you can pick from either one (and which also lets the handler support both GET and POST).

You can usually get at the separated values and/or the original data.


In code

Python

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Percent escaping for bytestrings

To say "escape everything, no exceptions" (like JS's encodeURIComponent()), try the following (that second parameter is what to not escape, which defaults to '/' and is a discussion for another place and/or time)

urllib.quote(bytestr,'')          # py2
urllib.parse.quote(bytestr, '')   # py3 reorganized where some functions went


Percent-escaping unicode strings

  • python2's urllib.quote() would raise a KeyError when given non-ASCII(since 2.4.2, anyway. before then it would pass it through (which was potentially confusing))
so you convert it first. UTF-8 is the convention.
  • python3's urllib.parse.quote(), when given unicode, will .encode(encoding) where that encoding defaults to 'utf-8'



Percent escaping dicts or sequences(verify) into an &'d result is easiest via urlencode

urllib.urlencode( {'v1':'Q3=32', 'v2':'===='})      ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
urllib.urlencode( [('v1','Q3=32') ,('v2','====')])  ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
(this urllib.urlencode() is mostly a little wrapping around quote_plus(), and you can do the same yourself)


Escaping values for POST body i.e. application/x-www-form-urlencoded style

comes down to "percent-escaping and joining the parts with &"
using urlencode() as mentioned above is often most practical, since you usually have a bunch of variables in a dict



Text into HTML/XML text nodes i.e. entities for <>&:

  • in py2, e.g. cgi.escape("This & That <3") == 'This &amp; That &lt;3'
  • py3 moved this to html.escape


Text into HTML/XML attributes can be done with entities for <>&'"


Notes:

  • I like having helper functions named escape.nodetext(), and escape.attr(), escape.uri_component()
makes it easier to skim code for whether it's conceptually doing the right thing.
  • Two ' related issues
    • the named entity &apos; is defined in XML, and e.g. HTML5, but not HTML4[5]. And some browsers are less forgiving about this, so a generic escaping function should use &#39; / &#x27; instead of &apos;
    • various XML, HTML allow wrapping attributes in either " or '
basic cgi.escape doesn't escape ' to &apos; so would break if you wrap in '
this isn't usually an issue, for entirely practical reasons: most people type out " if they're writing their own HTML, and if you use a templating thing it'll typically be doing the escaping for you (correctly)
...yet since it's sometimes safer, and as valid to entity-ize ' as any character, you might as well
  • you could also use &quot; instead of &#x22; / &#34; most of the time (defined in at least HTML 3.2, HTML 2, HTML 4 and 5, and XHTML 1.0, XML 1.0, and more. Not in quite everything(verify), yet in everything that you'll be using these days. In fact, cgi.escape(s, quote=True), likely a slightly faster call than the above, uses &quot;)








Javascript

Percent escaping

Since javascript 1.5: (TODO: summarize situation before javascript 1.5)

encodeURIComponent()

  • percent-escape all reserved characters (point being that you can transport most anything in an URL value (even if it's an URL itself))
  • escapes Unicode as %-encoded UTF-8 bytes

encodeURI()

  • does not percent-escape ":", "/", ";", and "?" (point being that the result is still formatted/usable as a URL)


There is also escape() but I avoid it:

  • Of the reserved characters, escape does not escape *@+/, but does escape the rest(verify)
It escapes escapes Unicode characters in a non-standard javascript-specific way (e.g. U+2222 encodes as %u2222), which most servers will not understand
Note however that you can abuse it to have the net effect of only UTF8 encoding/decoding, by combining it with the above. In the following, the percent part cancels out: function encode_utf8(s) { return unescape(encodeURIComponent(s)); }​ and function decode_utf8(s) { return decodeURIComponent(escape(s)); }​