Markup language notes

From Helpful
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

...that is, markup used more for documents -- for similar things mostly used for data, see e.g. Programming_notes/Communicated_state_and_calls#Data_and_serialization


These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Used in places like GitHub, reddit, Stack Overflow and SourceForge.

# First-level heading
## Second-level heading
### Third-level heading

Alternatively (for the first two only):

First-level heading
Second-level heading

links, images
[inline link](
[footnote link][]
![alt text](

*emphasis (italic)*
_emphasis (italic)_
**strong emphasis (boldface)**
__strong emphasis (boldface)__
***very strong emphasis (italic and boldface)***
___very strong emphasis (italic and boldface)___ 
Text with `some_code()`
    Longer code 
    should be indented with
    four spaces

Text layout
Paragraphs of natural text are separated by one or more empty lines.
Like this.
For code, you sometimes want manual line breaks. You can get those
by ending a line with two or more spaces
> Blockquote


Unordered bullet lists via +, - and *

Ordere lists via numbers.

+ Thing
  1. Numbered subthing
  1. Another numbered subthing
+ Other thing
    If you need a paragraph to belong to an item, use four spaces (or a tab)


Horizontal rules: A line containing three or more asterisks or minus signs, e.g.

* * *
- - -

See also:



Sometimes rst or reST (not to be confused with REST)


BBCode ('bulletin board code') allows a simple alternative to HTML, for users in forums, etc.

It is a little simpler to type, but perhaps more importantly, it makes sanitizing your input easier, both for invalid/unbalanced HTML that could disturb the page, and for things like nasty XSS script inserts, and do so in a "whitelist, don't blacklist" approach.

Generally, you would remove all html, then parse and convert bbcode to html, though removal of HTML is sometimes done unsafely in itself. (One alternative is having your BBCode parser escape all HTML so that exploit code is simply displayed verbatim)

BBCode is not a standard, so there is variation in what tags parsers will accept, and in what form they will or won't accept them. Consider:

  • capitalizing
  • nesting
  • spacing
  • unbalanced bbtags
  • unknown arguments, usage of arguments at all (see the various [url] styles)
  • how they transform it, and whether they guarantee correct HTML output (regexp-based implementations regularly do not)
  • whether they actually live up to the mentioned safety.

This depends mainly on how the implementer understand the intricacies.

For example, the core tags seem to consist of roughly:

[b]bolded text[/b]

[i]italicized text[/i]

[u]underlined text[/u]



    and sometimes also:
[url=]Link name[/url]

It's not uncommon to see:


[color=red]Red Text[/color]

[size=15]Large Text[/size]

[center]horizontal centering[/center]


[quote]quoted text[/quote]
   [quote=Will]quoted text[/quote]
   [quote Will said]quoted text[/quote]
[code]monospaced text[/code]


And I've seen mention of:

[link]  (same functionality as url)

* Item
* Item

[google], [wiki]  (search link, by term)

[spoiler]Dumbledore likes to boogie.[/spoiler]

[whisper=username]Psst.[/whisper] (private message to specific user on bboard)

[html]Freeform HTML. If available at all, only admins should ever get to use this.[/html]

[flash], [audio] (embedding, with various options)


No real reference; authors say "see the parser code".

For information extraction, it may be simpler to parse the resulting HTML, partly because the parser code does a little correction and normalization.

See also: