Markup language notes
|This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)|
...that is, markup used more for documents -- for similar things mostly used for data, see e.g. Programming_notes/Communicated_state_and_calls#Data_and_serialization
|These are primarily notes|
It won't be complete in any sense.
It exists to contain fragments of useful information.
Used in places like GitHub, reddit, Stack Overflow and SourceForge.
# First-level heading ## Second-level heading ### Third-level heading etc.
Alternatively (for the first two only):
First-level heading =================== Second-level heading --------------------
- links, images
[inline link](url.here) [footnote link][url.here] ![alt text](image.url.here)
*emphasis (italic)* _emphasis (italic)_ **strong emphasis (boldface)** __strong emphasis (boldface)__ ***very strong emphasis (italic and boldface)*** ___very strong emphasis (italic and boldface)___
Text with `some_code()`
Longer code should be indented with four spaces
- Text layout
Paragraphs of natural text are separated by one or more empty lines.
For code, you sometimes want manual line breaks. You can get those by ending a line with two or more spaces
Unordered bullet lists via +, - and *
Ordere lists via numbers.
+ Thing 1. Numbered subthing 1. Another numbered subthing + Other thing If you need a paragraph to belong to an item, use four spaces (or a tab)
Horizontal rules: A line containing three or more asterisks or minus signs, e.g.
* * * **** - - - ---------------------------------------
Sometimes rst or reST (not to be confused with REST)
BBCode ('bulletin board code') allows a simple alternative to HTML, for users in forums, etc.
It is a little simpler to type, but perhaps more importantly, it makes sanitizing your input easier, both for invalid/unbalanced HTML that could disturb the page, and for things like nasty XSS script inserts, and do so in a "whitelist, don't blacklist" approach.
Generally, you would remove all html, then parse and convert bbcode to html, though removal of HTML is sometimes done unsafely in itself. (One alternative is having your BBCode parser escape all HTML so that exploit code is simply displayed verbatim)
BBCode is not a standard, so there is variation in what tags parsers will accept, and in what form they will or won't accept them. Consider:
- unbalanced bbtags
- unknown arguments, usage of arguments at all (see the various [url] styles)
- how they transform it, and whether they guarantee correct HTML output (regexp-based implementations regularly do not)
- whether they actually live up to the mentioned safety.
This depends mainly on how the implementer understand the intricacies.
For example, the core tags seem to consist of roughly:
[b]bolded text[/b] [i]italicized text[/i] [u]underlined text[/u] [s]strikethrough[/s] [img]http://example.com/pic.png[/img] [url]http://example.com[/url] and sometimes also: [url=http://example.com]Link name[/url]
It's not uncommon to see:
[email]email@example.com[/email] [color=red]Red Text[/color] [size=15]Large Text[/size] [center]horizontal centering[/center] [pre]strikethrough[/pre] [quote]quoted text[/quote] also: [quote=Will]quoted text[/quote] [quote Will said]quoted text[/quote] [code]monospaced text[/code] [:-)]
And I've seen mention of:
[link] (same functionality as url) [list] * Item * Item [/list] [google], [wiki] (search link, by term) [spoiler]Dumbledore likes to boogie.[/spoiler] [whisper=username]Psst.[/whisper] (private message to specific user on bboard) [html]Freeform HTML. If available at all, only admins should ever get to use this.[/html] [flash], [audio] (embedding, with various options)
No real reference; authors say "see the parser code".
For information extraction, it may be simpler to parse the resulting HTML, partly because the parser code does a little correction and normalization.