Escaping and delimiting notes

From Helpful
Jump to: navigation, search
Related to web development, hosting, and such: (See also the webdev category)
jQuery: Introduction, some basics, examples · plugin notes · unsorted
node

Server stuff:

Dynamic server stuff:

Unsorted

"Escape sequences"

Some distinctions

Escape characters

In particular representing special characters.

This is common within string literals in a lot of programming languages. Command line interfaces may have similar behaviour.


The exact set of characters varies per language, as does the syntax.

The following set (and syntax) is fairly common:

Sorted roughly per reason

  • you specify this way because you cannot type them literally, e.g.
\a
0x07 Alert (Beep, Bell) (added in C89)[1]
\b
0x08 Backspace
\f
0x0C Formfeed
\n
0x0A Newline (Line Feed) (but see possible translation)
\r
0x0D Carriage Return
\t
0x09 Horizontal Tab
\v
0x0B Vertical Tab
\e
0x1B escape character (not always)
  • you specify this way because the context (delimiting and/or escaping syntax) makes it a special case, e.g.
\'
0x27 Single quotation mark
\"
0x22 Double quotation mark
\\
0x5C Backslash
\?
literal question mark, specifically in C, to avoid trigraphs
  • you specify this way because because you want to avoid the literal it implies - because it's clearer, leaves the file as ASCII, or such
\xhh
hexadecimal number
\ooo
octal number
\Uhhhhhhhh
,
\uhhhh
, or similar for unicode, in languages that have unicode strings as a type. These are a lot less common than the above.



Special meanings to serial lines, modems, text terminals, shells

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Control sequences are generally understood to be those that alter device state, rather than being printed as a character.


These often used non-printable characters, though not always - consider Hayes/AT style modem commands.


On things like serial lines (which includes terminals and modems) there is no separate control channel, so you have to do in-band signaling of any contro.

You would mostly send just literal text/data, but also intersperse commands, toggling between the literal and commands two with special characters.

(And yes, this means that sending arbitrary data through such a line is more involved, and is the source of various conventions and protocols).


Once 'terminal' started meaning something with a screen (rather than an automated typewriter), there were a bunch of controls aimed at controlling that screen.

For example, look at ANSI escape codes, and ECMA-48, "Control Functions for Coded Character Sets", though note there are a bunch of privateish extensions to this in practice.


These are sequences of bytes, and a lot of these cases start with the byte value 0x1B (you'll also see it as \033 in octal, 27 in decimal), which is what the Esc key emits (roughly why you can could these escape sequences).

The length of the sequences varies with what's in there.

A lot of cases is escape character, subset, (optional parameters), command character,
where that command also implicitly terminates the escape


Within the ANSI set (the real world of 'any terminal ever' is a lot messier) there is a loose grouping of these commands into types. For example,

  • DCS, "Device Control String"
Start with ESC P (0x1b 0x50)
things like fetching capabilities (verify)
  • OSC, "Operating System Command"
Start with ESC ] (0x1b 0x5d)
things like window title
  • CSI, "Control Sequence Introducer"
Start with ESC [ (0x1b 0x5b)
more expandable, e.g. with
;
-separated arguments
clear screen, cursor and control stuff
https://en.wikipedia.org/wiki/ANSI_escape_code#CSI_sequences
includes SGR, "Select Graphic Rendition"
which is typically used for color/bold/blink type stuff
all terminated by
m
https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_(Select_Graphic_Rendition)_parameters



See also:


ANSI color

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Color in shells is mostly a subset of SGR escapes (see also mention above)


And then mostly in the form

ESC [ textspec m


Some of the more interesting and better-supported codes (search for Select Graphic Rendition)

  • 1: bright foreground color
bright variants can be shorthanded
  • 0 clears to default colors/no bright/underline(verify)
useful to start a sequence, because by default you only change the current state
also useful in itself, e.g. to stick at the end of a colored prompt
  • 4: set underline
  • selective clearing:
39 default foreground color (default color decided by the terminal itself(verify))
49 default foreground color
22 clears bold
24 clears underline

The classical set of colors:

  • 30 foreground: black (bright+black is how you get dark gray),
  • 31 foreground: red
  • 32 foreground: green
  • 33 foreground: yellow (non-bright yellow is brown or orange on some schemes),
  • 34 foreground: blue
  • 35 foreground: magenta
  • 36 foreground: cyan
  • 37 foreground: light grey (bright+lightgrey is how you get real white)
  • bright foreground colors can be shorthanded: 90..97 are effectively shorthand for 1;30 .. 1;37 (verify)
  • 40 background: black
  • 41 background: red
  • 42 background: green
  • 43 background: yellow
  • 44 background: blue
  • 45 background: magenta
  • 46 background: cyan
  • 47 background: light grey
note that bright applies only to foreground


Note that you can specify multiple codes at once, separated by
;
, before the
m

So, for example:

  • 35
    sets the foreground to magenta
  • 37;44
    is grey on blue
  • 1;37;44
    is white on blue
  • 1;33;45
    is bright yellow on magenta
  • 0;35
    clears foreground and background flags and colors, then sets the foreground to magenta
  • 0;1;35
    same but with bright magenta
  • 0;1;4;35
    same but also underlined
  • 1;47;30
    is grey (bright black) on white

To play around in a prompt, try something like:

printf "\x1b[0;1;33;45mtest\x1b[0m\n"


See also:


More colors

Modern terminals may also have 256-color and true color modes.

This is usually advertised via TERM, but there's a bunch of footnotes to that (TODO: write them up)

Assuming you've detecting it'll work, it's basically just another color-set-code with a new set of possible alues squeezed within the regular SGR
[
and
m
38 set foreground color with one of these newer codes
48 set background color with one of these newer codes


Followed by either

  • for 256 color':
    ;5;n
    where n is
0..7: (like 30..37): black red green yellow blue magenta cyan gray
8..15: (like 90..97, the bright variants of the above)
16..231: 6×6×6 rgb cube. To calculate:
multiply red (range 0..5) by 36,
multiply green (range 0..5) by 6,
use blue (range 0..5) as-is
and add 16 to put it in this range
232..255: 24 shades of gray


For example, to test support you could do:

for i in {0..255} ; do
    printf "\x1b[38;5;${i}mcol${i}\x1b[0m\t"
done; echo



  • for truecolor:
    ;2;r;g;b
where r, g, and b can each be 0..255

Supporting terms: seem to include xterm, konsole, libvte derived including gnome's

e.g.

for i in {0..255} ; do
    printf "\x1b[38;2;100;100;${i}m100,100,${i}\x1b[0m\n"
done




What happens with these when then terminal doesn't support it?

In general, it will be ignored.

Some seem to try the closest supported color.


See also:

Preparing for a specific parser

Preparing a string to hand it to a known parser, to ensure the data that comes out of that parser is what you intended.

And yes, this has overlap with serialization, but is often a l


Text/data into (X)HTML

tl;dr mostly:

  • escaping text that goes into node text (dealing with
    <>amp;
    )
  • escaping text that goes into attributes (dealing with
    <>amp;"'
    )
  • escaping text/data to be included in an URI
see #Percent_escaping_.28URL_encoding.29. If then put within a page, you then pass it through the above (mostly attributes)



If you produce or consume HTML or XML, you want to know how to write values in a way that avoid messing up (strict) parsing.


  • Text nodes
    • <
      &lt;
      and
      >
      &gt;
      , to avoid interpretation as tags (which would cause things to magically disappear, become invalid, and in some cases allows injection of nasty things).
    • &
      &amp;
      (because entities will be parsed in almost all places, including text nodes and attributes)
  • Attribute text
    • &
      &amp;
      (as HTML parses entities in both text nodes and in attributes)
    • should have quotes replaced, at least
      "
      with
      &#x22;
      or
      &#34;
      , or
      &quot;
      , to avoid nasty parse/injection cases.
    • Replacing
      '
      with
      &#x27;
      or
      &#39;
      is generally also a good idea as some standards (e.g XML) allow attribute delimiting using single quotes as well as double quotes. Note that using the &apos; named entity is invalid for HTML before 5, but valid for XML and thereby XHTML. While browsers will be robust to both, strict parsers trip over apos
you could either only do it when you know the attribute uses ' (most people seem to use "), or always.


Further notes:

  • you can always escape each charater, a proper parser has to decode it anyway. The maiifference between
    &#79;
    for
    a
    and
    &#39;
    for
    '
    is that the first is pretty much never necessary and the latter can be in various contexts.
also it's a lot more bytes, so people try to avoid the truly unnecessary escapes
  • URLs have no special status in most places and must follow text/attribute encoding. In the case of attributes (href, src, and such), this means dumping them in is quickly invalid, most commonly because of
    &
  • Using a function that is often called something like "HTML escape" or "CGI escape" solves most entity-related problems, replacing at least
    &<>
    with entities. It may or may not also do
    "
    and/or
    '
    , so check before relying on it.
  • &gt; is not strictly necessary, in node text nor attributes, as parsers usually only look for < as a tag-significant character, but it's consistent and readable to do both.


Shell escaping

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


What?

Crafting a string to be handed into a shell command will often include filenames, and filenames could contian characters that have special meaning to a shell.

POSIX lists these[1] as:

"  '  \  $  |  &  ;  <  >  (  ) ` space  tab  newline
*   ?   [   #   ˜   =   %

Some contexts are safer (e.g. that second line is less trouble than the first), but we probably want a solution that works for all cases.


Handling of non-ASCII is not defined in POSIX (and control codes, being nonprintable, would be hard to type), yet in practice, any character that doesn't have special meaning to shell parsing can be in there as-is (with some minor worry about getting mangled by something else).



From your own code

Probably the easiest approach is to put single-quotes around the string, which avoid basically all shell processing - you just have to deal with the case of
'
itself being in the string.

"A single quote may not occur between single quotes, even when preceded by a backslash.".

However, because shells will append strings that aren't separated, you can deal with it in chunks.

For example,
a'b
can become
'a'"'"'b'
, and this style (over
'a'\''b'
) also avoids backslashes, which can avoid some confusion in cases that use multiple layers of escaping.

As such, this style is probably the easiest to implement, and potentially less fragile too.


Tools like
ls
seem to follow a similar style. They may also use ANSI C-quoting,
$'\022'
style (note that many but not all shells support this, not being POSIX) for control bytes (and, apparently, non-ASCII bytes that that aren't valid in the current encoding(verify)), probably so that you can also safely print the result in a shell without risk of that including control codes that the shell interprets and may make it go weird.

This should not be necessary to constructing commands.


https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Quoting

https://stackoverflow.com/questions/15783701/which-characters-need-to-be-escaped-when-using-bash

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_02_01


URI parsing, escaping, and some related concepts

(See also Uniform Resource Somethings)

Percent escaping (URL encoding)

Percent encoding, also often URL encoding, refers to
%hh
style (zero-padded two-character hexadecimal) for bytestrings where necessary.


Doing all of these (particularly the last) lets you dump arbitrary data into URLs, into XML/HTML attributes, and into HTML POST bodies (more specifically, application/x-www-form-urlencoded data).


For example,

foo=bar quu

becomes

foo%3Dbar%20quu

...or (see notes below)

foo%3Dbar+quu 


In contexts that do percent encoding, it is typically

  • required for byte values 128 through 255
  • either required or a a good idea for non-printable characters
mostly meaning the C0 control codes: 0x00 through 0x1F, and 0x7F
which includes whitespaces, which in many cases you'ld indeed want escaped instead of literal
  • often a good idea for most symbols (because between URLs and markup languages, most are significant as delimiters)
  • allowed for any character - but for many (at least alphanumerics) it's just waste space



On exceptions.

While it's good to escape all symbols, there are specific exceptions.

For example, when constructing different parts of a URL from code, different details apply to different parts.

Probably the most common case is safely dumping arbitrary bytestrings, or URL, into a URL variable in a way that will make it past URL syntax parsing unscathed.

It matters that different parts of the URL have different rules, and space is a special case.

See notes below.


Note that

  • browsers tend to be forgiving about various technically-incorrect things
  • this isn't very good normalization, e.g. URL comparison is sort of hard, as is thinking of all the possible weird edge cases. There have been many URL (de)coding bugs and a few exploits because of this.


Reserved characters

Reserved characters refer to those that have special meaning (primarily) to URL parsing.

You would percent-escape these when you want to dump these as non-delimiting values into URLs.

People do occasionally forget this, and this isn't always an issue .


The set of of reserved characters changed over time. From recent to older:

  • ;/?:@&=+$,!*'()%#[]
    according to RFC 3986 (2005), 'Uniform Resource Identifier (URI): Generic Syntax'
  • ;/?:@&=+$,
    according to RFC 2396 (1998), 'Uniform Resource Identifiers (URI): Generic Syntax'
  • ;/?:@&=
    according to RFC 1738 (1994), 'Uniform Resource Locators (URL)' (same set mentioned in RFC 1808)



The best known special case is that space can be %20 or +.

  • RFC 1866 (HTML2) standardized the use of + as an alternative in application/x-www-form-urlencoded data
8.2.1 mentions "space characters are replaced by `+', and then reserved characters are escaped as per" ...
...it then also uses the + in example URLs elsewhere, without formalizing it, which has led to some confusion (verify)


TODO: dig through further standards, but it seems that:

  • parsing a query string (the part of an URL after
    ?
    ) should turn both %20 and + into space (because W3C says so(verify))
and since everything should parse that way, it's safe to insert + there
  • URL parts before the query (usually primarily the path) a should use only %20
(so that a + can be left unencoded?) (verify)
  • in practice
    • most non-browser code is likely to not treat URL parts differently, because most of us are too lazy and/or don't know about this
    • there are other bordercases
    • your language may have different functions, to disginguish between + and %20, and/or to control what not to encode.
  • so when producing URLs with space, it may be less confusing and less edge-casey to just use %20 everywhere

Unicode

Since URLs can be considered to carry bytes, unicode first needs to become bytes, so that we can then put it in there %hh style.

At least, that's by far the most common way to put bytes in URLs, following RFC2718, section 2.2.5 (also seems the default in the IRI (Internationalized Resource Identifiers) draft), which also also the path of least bother.


Codepoints becoming bytes means it first needs to be put through a byte encoding/

For text, UTF8 is a good default because of its wide applicability, covering all of unicode.

Some older product-specific code has been known to use cp1252 and/or latin1 or such


(There is also a %uhhhh format mentioned in the IRI draft, but apparently not in a decisive way. It isn't standard, and seems unlikely to become one.)

URL parsing, GET query strings, POST data

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

We often parse URLs as structures like:

<scheme>://<net_loc>/<abs_path>?<query>#<fragment>
<scheme>://<net_loc>/<abs_path>;<object_params>?<query>#<fragment>

The parts (according to RFC 1808):

  • scheme ":"     scheme name - RFC 1738, section 2.1
  • "//" net_loc   network location and login information RFC 1738, section 3.1
  • "/" path       URL path - RFC 1738, section 3.1
  • ";" params     object parameters - RFC 1738, section 3.2.2, e.g. ";type=a" (used in FTP, occasionally in HTTP URLs)
  • "?" query      query information - RFC 1738, section 3.3, e.g. "?a=b&c=d
  • "#" fragment   fragment identifier


(TODO: add the newer styles as well)


Most language libraries have some functions that splits a URL into these RFC-defined parts, often without any other decoding (such as decoding percent escapes or splitting variables).

Such a function may take:

'http://example.com/goo;type=a;s=search?foo=bar&quu=%3d#help'

And respond with:

'http'
'example.com'
'/goo'
'type=a;s=search'
'foo=bar&quu=%3d'
'help'

Note that such splitters:

  • ...may also parse netloc's parts (username, password, hostname, and port)
  • ...may not be aware of object parameters, and leave them as part of the path.


There are usually different functions to parse specific parts of a URL, such as parsing out variable-value maps out of the query string, which happens according to application/x-www-form-urlencoded, which can be summarized as:

  • name/value pairs are split with &
  • name is split from value with =
  • names and values are escaped:
    • space is replaced with +
    • reserved characters (and non-ASCII bytes) are percent-encoded
  • form control names/values are listed in the order they appear in the document



HTML forms are often encoded either into the URL's query, or placed into the data area of a HTTP POST.

When using POST, basic HTML forms are often sent using the application/x-www-form-urlencoded encoding (the default in HTML standards), which mirrors how a GET request would put form elements in the URL (see URL/percent encoding above), but avoids possible problems related to URL length limits.

This is simple to implement, but for binary file uploads this can be rather inefficient (with non-ASCII bytes represented using three ASCII bytes), so there are alternative ways of encoding a POST's data for file uploads, which follow the format of multipart MIME messages:

  • multipart/form-data (see RFC 2388)
  • multipart/mixed (for more complex MIME streams, such as multiple files in a single form entry -- which at least browsers generally avoid doing(verify))


See also:


Note that while var/vals in POST and var/vals in GET are separate things, most web-aware scripting makes the basic interface to them transparent, so that you can pick from either one (and which also lets the handler support both GET and POST).

You can usually get at the separated values and/or the original data.


URN parsing

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A URN looks like:

"urn:" <namespace> ":" <namespace_specific_string>

For example:

urn:isbn:0451450523
urn:ietf:rfc:2648
urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66

urn: is regularly omitted.

You can only parse from left to right, and only with knowledge of how each namespace works, since the parsing of the value is entirely namespace-specific.


Since most URN-using applications deal with very few namespaces, you can usually do no more than split on the first : to separate top-level namespace from the value at that level.

Practical URN parsing systems will often either hardcode what they know, or allow registration/hooks for handlers per namespace.


There is an official URN namespace registry, at http://www.iana.org/assignments/urn-namespaces/

Your self-defined private-use namespaces should be prepended with X- (similar to unregistered MIME types), to signal them as experimental. For more global use, you should register your namespace.

Notes:

  • The urn:ietf:rfc:2648 example isn't really namespace nesting as far as the URN standard is concerned, but such nesting conventions may be very practical (and in this specific case, is actually formalized, in RFC 2648)
  • The "urn:" part is sometimes left off of URNs.

In code

Python

Percent escaping for bytestrings, either:

urllib.quote(bytestr,'')
that second parameter means "Escape everything, no exceptions", so also escapes
/
(like JS's encodeURIComponent()). Useful to make values safe no matter what they are.
If instead of "add these things safely to existing URL string" you do "this is an URL I added things to, take the whole thing and make text safer" (not the best approach but sometimes enough), consider something like urllib.quote(bytestr,':/;?') (imitation of JS's encodeURI, so result can still be an URL)

For unicode, you usually you want to hand in unistring.encode('utf8') to the above


Percent escaping dicts or sequences(verify) into an &'d result is easiest via urllib.urlencode()

urllib.urlencode( {'v1':'Q3=32', 'v2':'===='})      ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
urllib.urlencode( [('v1','Q3=32') ,('v2','====')])  ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
note that urllib.urlencode() is mostly a little wrapping around quote_plus(), and you can do the same yourself.
Percent escaping values for POST body i.e.
application/x-www-form-urlencoded
style
comes down to "the parts of a dict/list &'d together" case
using urllib.urlencode() is often easiest (since you usually have a bunch of variables)

(For the last two, dealing with unicode means you may want some helper functions(verify))


Text into HTML/XML text nodes i.e. entities for <, >, &:

cgi.escape("This & That <3") == 'This &amp; That &lt;3'


Text into HTML/XML attributes

cgi.escape(s).replace('"','&#x22;').replace("'",'&#x27;')


Notes:

  • I like to have helper functions named
    escape.nodetext()
    , and
    escape.attr()
    (and
    escape.uri_component()
    )
making it easier to skim code for whether it's conceptually doing the right thing.
  • you could also use &quot; instead of &#x22; / &#34; most of the time (defined in at least HTML 3.2, HTML 2, HTML 4 and 5, and XHTML 1.0, XML 1.0, and more. Not in quite everything(verify), yet in everything that you'll be using these days. In fact, cgi.escape(s, quote=True), likely a slightly faster call than the above, uses &quot;)
  • Two ' related issues
    • the named entity &apos; is defined in XML, and e.g. HTML5, but not HTML4[2]. And some browsers are less forgiving about this, so a generic escaping function should use &#39; / &#x27; instead of &apos;
    • various XML, HTML allow wrapping attributes in either " or '
basic cgi.escape doesn't escape ' to &apos; so would break if you wrap in '
this isn't usually an issue, for entirely practical reasons: most people type out " if they're writing their own HTML, and if you use a templating thing it'll typically be doing the escaping for you (correctly)
...yet since it's sometimes safer, and as valid to entity-ize ' as any character, you might as well
  • The above does not care about unicode, because it assumes you (or your framework) are working in unicode, and translating it to byte coding at the edges of the app (note this relies on string functions passing through unicode as-is, which is typically true in python).
...rather than making sure each inserted string is correctly byte-coded, because that's usually more work (but equally valid if you do it consistently, and there are a few cases where it's handier - or even the only option, e.g. in PHP(verify)).












Javascript

Percent escaping

Since javascript 1.5: (TODO: summarize situation before javascript 1.5)

encodeURIComponent()
  • percent-escape all reserved characters (point being that you can transport most anything in an URL value (even if it's an URL itself))
  • escapes Unicode as %-encoded UTF-8 bytes
encodeURI()
  • does not percent-escape ":", "/", ";", and "?" (point being that the result is still formatted/usable as a URL)


There is also
escape()
but I avoid it:
  • Of the reserved characters, escape does not escape *@+/, but does escape the rest(verify)
It escapes escapes Unicode characters in a non-standard javascript-specific way (e.g. U+2222 encodes as %u2222), which most servers will not understand
Note however that you can abuse it to have the net effect of only UTF8 encoding/decoding, by combining it with the above. In the following, the percent part cancels out:
function encode_utf8(s) { return unescape(encodeURIComponent(s)); }​
and
function decode_utf8(s) { return decodeURIComponent(escape(s)); }​



XML parsing notes

On control codes and arbitrary binary data in XML

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Short story: don't expect this to just work.

Short suggestion: When in doubt about data serialization, use some escaping (that escapes to all-safe characters, which is most of them, so ).


XML has details pertaining to representing control codes (note: not related to containing them on a binary level; UTF-16 for western text contains a lot of 0x00 bytes)

These carry through to XHTML, and note that XHTML has some additional restrictions.


Of the C0 control codes (U+00 through U+1F):

  • HTML and XML allow common whitespace characters CR (U+0D), NL (U+0A), and Tab (U+09), both literally and as escaped character references(verify)
  • XML 1.0 does not allow any other C0 control codes.
  • XML 1.1 allows use of U+01 through U+1F (note: not U+00). These new characters are required to appear as escaped character references(verify)


C1 control codes (U+80 through U+9F) and DEL (U+7F):

  • XHTML 1.0 does not allow them
  • XML 1.0 allows them literally and as escaped character references
  • XML 1.1 requires them to appear as escaped character references (the document specifically excludes them, even if the char definition includes them as in XML 1.0), except:
  • XML 1.1 allows only U+85 (NEL) to appear in both forms. It seems it will be translated to a line feed on parsing(verify).


This may be as the standards tell it (though this list hasn't been checked for completeness or even correctness), but parsers may have their own, often slightly simpler view. It's generally best to avoid control codes to be safe.


XML cannot contain raw binary data. To programmers, this means you can't use it as a serialization format for arbitrary data without applying an encoding step. Also, you strictly speaking should not assume XML does 1-to-1 serialization-deserialization for data (or even filesystem paths, as some filesystems allow a wide set of characters than others), for these and further reasons. (Another is that XML 1.1 parsers may transparently apply Unicode canonicalization).

Binary-to-ASCII codings are the simplest workarounds. URL percent escaping may be the easiest to use, Base64 another, but there is no pure-XML or easy XSL-compatible solution, and few general-purpose solutions are storage-efficient.


See also:

CDATA, PCDATA, CDATA section

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


CDATA and PCDATA are concepts within XML (and SGML(verify)) parsing.

It's relevant to understand distinctions between e.g. PCDATA and CDATA (and RCDATA), but not a full explanation of what you were probably looking for.


CDATA technically has a few different meanings depending on context:

  • In DTDs, attributes can be declared CDATA - as opposed to ID, NAME, NUMBER, or such.
Some of those alternatives imply uppercasing, tokenization
...and CDATA just means 'no such processing'.
This has no effect on entity parsing.
  • In DTDs, entity declaration (external or internal) may or may not say CDATA
if it does, it roughly meaning that what the value of that declaration won't be parsed
  • In DTDs, elements may be declared with CDATA contents.
which seems to be one of those "don't do this, unless you can very clearly explain why this is not a bad idea" things.
  • In documents, marked sections is a specific syntax (that can be declared CDATA, RCDATA, and a few others):
<![ CDATA [
and -- can " contain ' most & anything ] unescaped
]]>

...that tells XML parsers to treat that section literally, basically to not parse most things within it.


CDATA sections (more fully: marked sections using CDATA) is what most of us were thinking of.

The point of CDATA sections is often that it can includes markup-like characters that you don't have to escape. The most recognizable use is probably in <script> tags in XHTML, so that you don't have to escape < and & in your script.

(Note that CDATA sections doesn't mean that embedded markup remains untouched. After parsing <, querying the document would still give you &lt;. But that's something for the script parser to deal with(verify). The reason we do this is purely that us coders only ever see/editing at the document in serialized form)


Doing this has no meaning in HTML, where

  • <script> is specifically defined to be parsed that way already
  • HTML parsers officially do not know about this syntax, and will treat that as a literal string. (browsers might be resilient to this in the script tag, just because various coders copy-pasted it as some magic incantation they didn't fully understand)

As such, a CDATA section that may go to both XHTML and HTML could be commented out in the relevant language, like:

<script type="text/javascript">
//<![CDATA[
document.write("<");
//]]>
</script>


CDATA sections are sometimes used to dump more arbitrary data into XML.

If you care about XML validation and strict parsers, there are footnotes to this.

  • as implied above, markup like < may, depending on what consumes it, be turned into its entities.
as such, CDATA never really means 'include verbatim'
  • CDATA sections cannot include the string
    ]]>
    - seeing that will end that section regardless of context.
It doesn't seem there's an escape for this, so if you cannot know for sure that string will never occur in the data you want to store, the cleanest solution is probably another layer of encoding that you then put within a CDATA section. Or to not use XML.
  • This also implies nesting CDATA isn't possible, but there are workarounds like stopping and starting CDATA section.
  • CDATA sections may only use characters available in the encoding. This means you may want to escape them - but the other wide will need to know to unescape them.


If you want to dump binary data verbatim, HTML/XML cannot really guarantee this, and you're probably better of using a binary-to-text coding on both ends.


See also: