XML notes

From Helpful
Jump to navigation Jump to search
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Intro

Some upsides and downsides

On control codes and arbitrary binary data in XML

See Escaping and delimiting notes#On_control_codes_and_arbitrary_binary_data_in_XML


On externally defined entities

XML allows the definition of entities in a DTD.


Which implies that parsing the XML may require fetching any DTDs mentioned.

Depending on the XML library you use (and possibly settings)

  • it may have the DTD in a catalogue (if it's a very common one, or you put it in there)
  • it may have to fetch it
  • it may fail

...so yes, there are XML documents that cannot be parsed offline.


namespaces in XML

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In XML, each element can be in its own namespace.

Each distinct attribute can be in its own namespace too, but 99% of the time that's just masochism.




On uses

What are namespaces useful for?

A few things.

  • Overall document namespace - tack one onto the root element as a definition of the document type and version, so that something aware of specific namespaces (i.e. more than just a parser) can deal with different document types automatically
there are other ways to do this, and some can be more practical, but this is another perfectly valid way
  • Evolving complex document formats over time
In that it lets you segment off different concepts, different versions, etc.
  • Allows (mechanical) validation of such mixes
  • might make it easy for a parser to say "I know of the type but not the version, bug us to update it" instead of "error parsing document"
...but I'm not sure I have ever seen this done via namespaces, possibly because this is probably cleaner to do via some attributes.
  • Embedding fragments, or whole documents defined by another standard
Say, if you embed SVG in XHTML or indeed HTML, you should probably start it with <svg xmlns="http://www.w3.org/2000/svg">
upsides:
namespaces avoid ambiguity of what standard each node/attribute refers to,
namespaces avoids potential clashes if they use the same node/attribute name.
it's easy for a program to simply ignore anything we don't know.
arguables/downsides:
in many cases, a mix of standards is either
already completely standardized by explicit design (e.g. office documents, de facto standards used by any one program) and any embedding is essentially hardcoded in its specific parser, or
arbitrarily dumping XML in other XML basically doesn't happen -- because it is practically unclear how to relate the part to the whole, even if you know how to parse it perfectly well.
The best uses I can think of is
having a XML container format that can contain others -- e.g. specific things like XML-based search interface serving only XML metadata
having formats be future compatible based on a "you can completely ignore records that you do now the namespace of"


Namespaces as a hack on XML

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


On aliases

While you can stick the value/URI explicitly on every node that is to be in that namespace, this is unreadably verbose and very space-inefficient, so the typical way is to define an alias' (practically a.k.a. prefix) to be used as a shorthand.


Aliases are scoped, in that they are valid references only in the subtree under the node the alias is declared on.

...yet it is relatively typical to declare them all document-wide.
...and in fact typical to add the things you might use, even if you don't


Aliases are primarily useful for keeping human-written and serialized documents less verbose than they could be.


They are also arguably a broken abstraction that have has led to a lot of confusion.


For starters,The namespace is that value/URI, not that alias.

While looking at an example this looks like semantics, but there are real-world reasons this is weird. This can be a little abstract to grasp intuitively, so a few angles:

An XML parser, or XSLT transform, cannot tell you what that alias was in serialized form.
Loading from disk and writing to disk means aliases stay unique, but the actual string used for the alias does not and cannot persist.
The alias does not exist in the represented data, even though it exists in the (typically[1] human-readable) file that represents that data.
When processing matches on a namespace, it only cares about the value/URI, and when saving the result of processing in XML, the alias name cannot be saved (it's essentially just generating random unique identifiers, usually via enumeration as the ns0, ns1, ... convention shows).


Sure, examples often use human-sensible alias.

And examples will put that alias consistently into the XML, related documents, e.g. XSLT that transforms it.

But it turns out that aliases being readable names is a convenience only relevant to humans hand-crafting documents.

To machine parsing, this has no meaning. To a parser there isn't any real difference between:

<root xmlns:example="http://example.com"><example:element/></root>

and

<root xmlns:fhwdgads="http://example.com"><fhwdgads:element/></root>

and:

<root><ns0:element xmlns:ns0="http://example.com" /></root>

The last came from a pass of parsing an writing it out again, and demonstrates that while it looks different (here it moved the namespace definition down to the highest place in the tree that actually uses it) but is actually equivalent.


Notes:

  • If you do try to consider aliases part of the document model in any way, that implies the document model is not fully defined by the DTD (or such), only be the document itself.
  • ...in practice, the the DTD/Schema has no say about aliases
  • In transforms, like XSLT, this amounts to matching on the namespace value/URI.
aliases are there just for human readability



Namespaces and DTDs

tl;dr:

DTDs do not support XML namespaes at all
so if you want validation and namespaces, you need XML schema


DTDs have no syntax to define a namespace declaration or alias.

You can put a prefix: on names -- but it won't be a prefix, or namespace, in the XML sense of being separate, or of representing something a value/URI.

It essentially becomes part of the node/attribute name.


You can make a DTD with colons in its node names, you cannot create one that is actually namespace-aware. (e.g. how could you tell that that not-alias name maps to different things in different part of a document?).

"Usually won't in practice" is not why we do strict validation, which is why we use XML Schema instead (which works around this by itself being expressed in XML).



Notes on (ab)use of namespaces

Namespaces in XSLT

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

You must, in your XSL, declare all namespaces that you wish to match in the input XML, because you won't be able to match those otherly-namespaced things if you don't.


This means XSL must always be specifically hand-crafted for every specific transform you want and dealing with deviations in document, or even just different variants and versions of what conceptually is the same namespace (that would be easy to express in code) may be awkward or even impossible to express in XSL.


It's also quite wordy, because every part must match the namespace. You can save some typing by using the default namespace, which is anonymous (letting you write xmlns="bla" instead of aliased (xmlns:dc="bla"), but you still need to get the identifier right.

To match that in XSL, you need to define an alias with the same (URL) identifier (eg. xmlns:x="http://www.w3.org/1999/xhtml" if you happen to know the input is XHTML) after which you must use it on every tag you wish to be matched - which tends to be almost everything.

When that's a pain, blame the creators of the schema for unnecessary use of bothersome namespaces. ...or use XSLT2, which solves much of the alias bother.

Related - XSLT, XPath, XQuery, XSL-FO, XSL, etc.

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • XPath - XML Path Language
a query language that select/query nodes in a tree document, in a path-like way.
see also #XPath notes below
While part of XSL(verify) and used in XSLT and XSL-FO, it's also seen in a few other places, e.g. CSS, sometimes limited forms of it in HTML/XML parsers/scrapers
https://en.wikipedia.org/wiki/XPath


  • XSD - XML Schema Definitions, a.k.a. XML Schema
Lets you
specify structure that XML should adhere to
run a validator on it, to say that it does or doesn't
https://en.wikipedia.org/wiki/XML_Schema_(W3C)
Note that there is also RelaxNG, which is a little easier to handle (more DTD-like), and can be converted to XSD. In the end, it depends a little on your tooling and your goals.


  • Schematron
Contains a collection of rules, patterns to impose on XML document structure, than can be machine-evaluated. Allows some things that cannot be expressed within just DTD or XML Schema.
https://en.wikipedia.org/wiki/Schematron
https://www.xml.com/pub/a/2003/11/12/schematron.html



  • XSLT - specifies (in XML) how to transform XML into other XML
https://en.wikipedia.org/wiki/XSLT


  • XQuery - XML Query


  • XSL-FO "Formatting Objects" - specifies (in XML) how to transform XML into other types of documents
mostly used for article/print formats like PDF, PostScript, PCL, AFP,
also things like text (TXT, RTF), images, SVG
https://en.wikipedia.org/wiki/XSL_Formatting_Objects
e.g. implemented by Apache FOP


  • XSL (Extensible Stylesheet Language) is a name that groups XSLT, XSL-FO, and XPath
  • XDM (XQuery and XPath Data Model)
data model (mostly types(verify)) shared by the XPath, XSLT, XQuery, and XForms
  • XPointer
https://en.wikipedia.org/wiki/XPointer
  • XLink
lets you add links to XML (...variants that didn't get that already by merit of being XHTML), and lets you add metadata to those links
  • XProc
https://www.w3.org/TR/xproc/



Returning questions

  • "Isn't XQuery basically the same as XSLT?"
"XQuery is typed, XSLT is untyped"
...that is, XSLT1 was untyped. XSLT2 and XSLT3 are typed (but XSLT in browsers is often still XSLT1)
"XQuery was made only to fetch XML fragments, XSLT transforms to create a new XML document"
Things like Qizx/open which allows file output from XQuery.
XQuery has various extensions XSLT has no unoffical ones (verify)
XQuery Scripting Extension made it even more like a procedural language
XQuery usually makes more sense selecting from XML databases
XQuery is more brief and probably easier to learn, XSLT is quite verbose
XQuery is declarative, XSLT is functional.
Interesting: [2] [3]


  • "what's the difference between XSL and XSLT?"
officially, XSL it's a name grouping XSLT, XSL-FO, and XPath
practically, people may use XSL to refer to
XSLT (which often uses xsl to refer to its own namespace, and as the filename suffix)
XSL-FO
an old, non-standard MS implementation of XSLT (which was far more scriptable - though the languges you could use changed over time(verify)[4])



Unsorted

See also


XPath notes

/bookstore/book/title
/bookstore/book[1]/title 
/bookstore/book[@price>35]/price 
/bookstore/book/title
book/*[position()=1]
//tagname[@attribute='value']
//body/main/main-text//paragraph
//a | //b

/ is root // is anywhere under