Programming notes/Communicated state and calls

From Helpful
Jump to navigation Jump to search
Pieces of larger or complex systems
Varied databases
Message brokers/queues; job management
Locking, data versioning, concurrency, and larger-scale computing notes
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Local(ish) and Network geared

Varying technologies, techniques, and implementation types include the following:

'IPC'

Inter-process communication is a generic term, and some use it for many of the local-ish methods, dealing with one or more of:

  • message passing
  • synchronization
  • shared memory
  • remote procedure calls


In the broad sense, this can describe things of varying type, complexity, speed, goals, guarantees, and cross-platformness, including:

  • File-related methods (e.g. locking)
  • Sockets (network, local)
  • Message queues
  • process signals
  • named pipes
  • anonymous pipes
  • semaphores
  • shared memory

...and more.


One of the easiest methods to get cross-platformness is probably the use of (network) sockets - mostly since Windows isn't POSIX compliant, and most other things are even less cross-platform.

Threads (same-process)

In multi-threaded applications, IPC-like methods are sometimes used for safe message passing.

And sometimes convenience, as some can ease communication between multiple threads spread among multiple processes.

Same-computer

Fast same-computer (same-kernel, really) process interaction:

  • POSIX named pipes
  • POSIX (anonymous) pipes
  • POSIX shared memory
  • POSIX semaphore
  • SysV IPC (queues, semaphores, shared memory)
  • unix sockets (non-routable network-style IO on unix-stlye OSes), not unlike...
  • windows' LPC

Interaction between applications, nodes in a cluster, etc.

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Relatively manual

That is, interoperation support mechanisms that leave their use up to you.

See also MPI notes


Networkable application embedding


Some of the relatively general-purpose ones (see below) may also be practical enough

Relatively general-purpose (networked)

...and often cross-language frameworks (that often create fairly specific-purpose protocols), such as in:

  • RPC- "Remote Procedure Call", usually an explicitly exposed API, used via some data serialization. See #RPC variations
  • the web-geared SOAP (see below)


  • Apache's Thrift, geared to expose something something local via a daemon




Language-specific data transport frameworks

...such as


Fairly narrow-purpose protocols

...like

  • Flash's remoting
  • Various proprietary protocols (some of which are mentioned above)


RPC variations

A Remote Procedure Call is a function/procedure call that is not necessarily inside a process' local/usual context/namespace, often done via a network socket (even if calling something locally), also often on another computer, and possibly across the internet.

RPC may in some cases be nothing more than a medium that carries function calls to the place they should go, or a means of modularizing a system into elements that communicate using data rather than with linked code.


XML-RPC

XML-RPC is a Remote Procedure Call language that communicates via a simple, standardized XML protocol.

It is mainly a way to move function calls between computers. It adds typing, although only for simpler data like numbers and strings.


It is not the fastest or most expressive way to do remote calls, but is simpler than most others and can be very convenient, and not even that slow when implemented well. Usually, the focus is one of:

  • Run code elsewhere as transparently as possible
  • Provide a service over a well defined network interface
  • Provide a webservice

Arguably there's not much difference, but there can be in terms of what an implementation requires you to do: Does it make exising functions usable? All? Implicitly or explicitly? How much code do you need to actually hook in new functions? You'll run into this difference in example code and explanation when you look around for implementations.


See also:

XML-RPC, transparency angle:

XML-RPC, webservice angle:


ONC-RPC
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Open Network Computing RPC, the basis of NFS

See also:


Other RPC
  • ISO RPC
  • MSRPC (a modified DCE/RPC) [1]

DBus

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

DBus is a means of IPC, a specific standard that is part of freedesktop[2].

It is often present on modern graphical desktop distros, and almost necessarily when they are running GNOME and/or KDE.


DBus provides

  • a single system bus, mostly useful for services but available to all processes
the bus part is a reference to hardware buses: a shared thing where you talk/listen to interfaces known to you
avoids needing to be point to point (and parts having to know where to connect, and how to avoid collisions)
  • a session bus per graphical desktop login session
meant for integration of that environment - and security between sessions


Each participant on a bus

  • gets assigned a unique name, like :1.87, unique and unchanging for that connection.
many will also register a well-known name, e.g. org.freedesktop.UDisks2
  • has one or more objects
with
unique object paths like /org/freedesktop/NetworkManager/Devices/0
signals, which are the broadcasted things (consider them notifications)
methods you can invoke
Interfaces describe how both signals and methods work (and could be considered groupings of them)


So roughly speaking,

  • you can implement API-like things
dealing mostly with Method call / Method return / Error messages
  • or just listen for some messages you care about
dealing mostly with Signal messages

The signal stuff is pretty easy to use and avoids the unknown-failure frailties common in quick and dirty point-to-point solutions.

Even when used in a point-to-point app integration, there's already a name resolution thing done for you.


There are language bindings for loads of things, and the daemon is there on most distros.

The messenging is a little slow, so it's not ideal for everything.



Implementations include:

  • libdbus - reference implementation from freedesktop
  • GDBus (GNOME)
  • QtDBus (Qt/KDE)
  • dbus-java[9]
  • sd-bus (systemd)

See also

Unsorted (localish)

  • DDE
  • COM, which seems an umbrella term for at least OLE, ActiveX, COM+, DCOM
  • .NET remoting [3]
  • WCF [4]


Object libraries

  • SOM/DSOM
  • Distributed Objects Everywhere (DOE)
  • Portable Distributed Objects (PDO)
  • ObjectBroker,
  • Component Object Model (COM/DCOM)


ICE

.ICEauthority is part of Inter-Client Exchange (ICE) Protocol,

WCF

Communication as a service

Amazon

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Amazon MQ - is basically a managed ActiveMQ or RabbitMQ instance.

e.g. useful to move a codebase already using these open/standard protocols to cloudiness with minimal changes


Amazon SQS (Simple Queue Service) is their own, distributed thing

but is indeed simple in that it doesn't do routing, and such
though you can get some of that if you use both SQS and SNS


Web-geared

On REST

For context
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

REST is a set of architectural guidelines introduced by Roy Fielding (around 2000) and, in that typology, is one of a number of broad types of state/storage systems he identifies.

REST in this sense helps build a specific kind of distributed system, and applies in particular those structured like hypermedia systems, which distribute objects in different places.

REST's "representational state transfer" is a concept relating to what such a system should and in particular shouldn't do.


Say, HTTP serves one webpage per URL, an idea heavily influenced by REST, via the design of its application layer. HTTP's version 1.1 redesign was guided by Fielding's REST, and such concepts remained.

RESTful ideas are ingrained in how webdevs even think about HTTP at all, many protocols built on it, what standards say about e.g. cacheability of specific requests, ...without us necessarily realizing the type of design this came from.


Probably almost all HTTP requests on the web conform to REST ideas, and are REST requests, in this sense.

APIs in general are at some risk of being non-RESTful (regardless of whether the marketing uses the word REST) in part because their aim is changing state that may be nontrivial.


Also relevant is the concept of Hypermedia as the Engine of Application State, which is more specific. It e.g. suggests you can have clients that need minimal prior knowledge about how to interact with an application (unlike something that uses an interface description language, or just gives documentation for coders on how to interact with you).

But what REST means today (instead)

Today, we have abused the term REST so hard that it now means "API built on top of HTTP", regardless of whether it meets Fielding's properties. It may, it may not, either way it's often not fully intentional.


Yes, many specific things are said of REST, yet they are all merely habits, that not everyone adheres to.

(REST originally didn't even mean 'API'. But that's semantics, because something that complies to REST ideas are a great basis for APIs - a specific design of API, anyway)

"RESTfulness" is usually more about an API that centers on resources -- compared to APIs that focus on operations like RPC and SOAP, which are unRESTful in either meaning.


While RESTful APIs do tell you to center on resources, that's usually the only part of the original REST they invoke, and arguably CRUD describes that better than REST.



Ask yourself: if someone says REST API, what is that guaranteed to give you?

that interactions are complete when stateless, without sessions or cookies required to function?
distinct endpoint URLs per modeled thing?
different HTTP methods mapping to different operations on that modeled thing?
that the read/GET operation is valid to cache?

The first is likely, the rest are merely probable.


It moves quickly to "whatever you dream up":

Exactly what kind of data model it alters?
Case specific.
At what level of detail?
Case specific.
It models things, but whether those are object level or service level?
Case specific.
URL path structuring?
Case specific.
There are conventions, but none strong enough that you don't have to read the docs. And some don't use the path at all. Some specify the type. Some specify item as well.
and some people insist on making it thinly disguised RPC
HTTP methods in use?
Case specific.
various things that like to define a protocol actually ignore HTTP semantics a little, which is a little ironic since that is the closest the REST concept any of this gets. But also rarely a problem because you end up writing a fairly custom client anyway...
HTTP response codes?
Case specific.
Please don't try to map application level errors to HTTP codes, because the implied semantics are frequently wrong.
Authentication?
Case specific, though
most make a point of not relying on cookies (stateless, after all), and
many make a point of requiring TLS -- or should, because per-request login would be terrible security


Even if you've set up endpoints to verb all your nouns, it is not hard to set up that HTTP API in a moderately RPC-operations style, one of the things REST was specifically trying to avoid.


That said, this is also largely curmudgeonery


Beyond REST becoming a vague term (much like MVC and friends), none of this really matters too much in practice.

HTTP APIs not being RESTful makes them no less valid, any less good at whatever they need to get done.

(it makes it slightly harder to discuss why their object model may be awkward, but that's about it)

HTTP APIs not being RESTful doesn't really matter, because we rarely use it to architect.

Most HTTP APIs today are a thin layer of frontend-backend transfer, which is by its nature not about representational state transfer, and does not distribute objects.

Those HTTP APIs used to expose some data may happen to work somewhat RESTfully - mostly by accident.



Pragmatism

Just make your thing and ignore people like me. Or if you care a little, maybe call them "HTTP API" or "web API" or something, and avoid the whole discussion.

(But maybe still spend some time reading up good and bad ways to architect APIs)



See also:

OData

OData (Open Data Protocol) takes the recent RESTful idea, and adds more specification and suggested best practices.

When you have structured data you want to expose, following OData starts off with a better-defined basis, meaning less reading up and/or less coding you and/or the people who access it.

See also:


GraphQL

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

GraphQL is a query language, intended to be used mainly for APIs.


Perhaps the biggest upside, and downside, is that it is often a single point of access that allows more complex structured queries than REST.

It's an upside to frontend devs, which now have a single entry point with most views of the data they want.
But it can also be the reason things are more fragile or slower


It's easy to think of GraphQL from a backend perspective, but it seems the backend benefits are relatively marginal (other than some tooling making it easier to write).

So it can be a great way to speed up early development, but it can also become a come at the cost of good design and be a great way to trip yourself into a lot of tech debt.

This is up to you, because the upsides, and potential downsides, still depend a lot on how you implement it - but it is common to not think about that (it autogeneratesm, wheee) until it becomes an issue.



gRPC

https://en.wikipedia.org/wiki/GRPC

SOAP

SOAP is a well-typed serialization format in XML, useful between systems that like typed and structured interchange without having to think about binary compatibility.

For example, .NET uses it as one of its object serialization formats.


SOAP also describes a HTTP-based protocol for exchanging SOAP data, often used to do remote procedure calls.

This use resembles XML-RPC in various ways.

it could be said to be an improvement over XML-RPC in terms of typing, basically being better-defined remote function calls
However, you can argue that pragmatism was ignored when SOAP was defined:
SOAP implementations may differ on what parts they implement/allow (meaning interoperation can actually be difficult)
SOAPAction's format is a little underspecified
interations are moderately entangled with HTTP
also meaning that if you want security, you need to add certificates
it's not the fastest to parse, probably because of its verbosity, another reason it's not practical for latency-critical needs.
the amount of XML namespaces is mildly ridiculous -- you basically never want to write any SOAP content yourself. You will also probably not manage a quick and dirty parser - you will likely need a decent SOAP implementation.


You can argue that this makes SOAP a good serialization format but not a very good RPC medium.

SOAP the protocol seems to have fallen out of style a little more than SOAP the serialization.


If you want to avoid a SOAP stack (I did because the options open to me at the time were flawed), the main and often only difference with a normal POST request (with the SOAP XML in the body) is that you need to add a SOAPAction header.


The SOAPAction header is used to identify the specific operation being accessed.

The exact format of the value is chosen by the service provider (rather than standardized, which is why you would generally need a SOAP stack for more generic SOAP based access).


See also


Describing at things you can talk to

The entire concept of describing endpoints isn't very common.

Most people find them because they're popular, or because they already know they're necessary for your job, and either way and you can just read the documentation.

Yes, in theory you can take a description and turn it into code. Or you could provide a library or plugin. Or make the thing easy enough to interact with that you don't really need to.


So endpoint description languages are often entangled with some of the more verbosely specced interchange standards, like SOAP.

Say, WSDL and UDDI have been falling away just as much as SOAP has.


WSDL

WSDL describes web services, often (but not necessarily) SOAP.

WSDL mostly just mentions:

  • the URL at which you can interact (often using SOAP),
  • The functions you can call there and the structure of the data you should send and will get (using XML-Schema)

WSDL allows you bootstrap SOAP RPC based on a single URL, without writing code for the SOAP yourself. You can use a program that converts WDSL to code, or compiles it on the fly.

This doesn't always work as well as the theory says, though, mostly because of the complex nature of SOAP.


UDDI

UDDI (Universal Description, Discovery, and Integration) is a way to list exposed web services.

It is primarily useful as a local (publish-only) service registry, to allow clients to find services and servers: the idea is that it yields WSDL documents, via SOAP.

It was apparently intended to be a centalized, yellow pages type of registry for business-to-business exchange, but is not often used this way.(verify)


See also:

Data and serialization

This section is focused more on the "formats you can easily dump structured data to, and read it back from later"


May focus on different aspects, like

interchange (more focus on correctness, possibly self-description),
storage (more focus on compactness)
human-readability



JSON

JSON ('JavaScript Object Notation') is the idea of using javascript data serialized as text, as a data exchange format.

Content type is application/json

See RFC 4627.


It writes into a string that Javascript itself could eval() directly -- but for security reasons you should never do that, and be in the habit of always using a JSON parser (library or, by now, browser built-ins).

It's common enough that many other languages can also easily consume and generate JSON.


It can contain numbers, strings (without worrying about unicode characters), arrays, associative arrays (a.k.a. hashes, dictionaries).

Limitations

  • Date does not serialize to JSON
There are probably two common conventions:
the format emitted by Date's toJSON() standard enough since ~2011, because it conforms to ISO8601. However, browsers parsing this format is merely common, not standard(verify).
seconds or milliseconds since epoch (float or int)
http://stackoverflow.com/questions/10286204/the-right-json-date-format


  • not very useful for binary data, as strings are expected to be decoded as unicode. Escaping or text coding (e.g. base85 is decently efficient) can work around that, but you will have to implement that.
  • There is no NaN in JSON even though it is a thing in JS.
Some JSON libraries, e.g. Python's, do (incorrectly) generate NaN, presumably to deal with numpy. You should probably rewrite these as None so they become null instead.


Practical variants

JSONP
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Not actually a variant of the serialization, but a specific way to use it in JS.

JSONP tells the client telling the server to wrap some text (a function call) around JSON data.

typically mentioned more explicitly by tutorials.

...probably because you're doing <script> tag injection.

typically not explained as well in these tutorials, particularly how it can be a bad idea

The fact that this was sometimes somewhat hidden (jQuery's library would let you specify datamode=json, which basically fetched text, and datamode=jsonp, which actually did script tag injection for you) made for some further confusion.


So why wrap text around JSON?

Say you are www.example.org want to fetch data to feed data from api.example.org into your code after the page loads.

You pick XHR, and find out that the browser's same-origin policy just says 'no'.

You could use a <script> tag, which is not bound by same-origin policy.

But now instead of fetching data as data you might feed into your own parser, you're fetching JS as code.

This is where that wrapping text comes in: it lets the client decide what kind of call should happen. This is basically the client code telling code how to use itself.


...but the entire thing still amounts to an eval, and basically must trust the server.


Now that CORS exists, it is basically servers that enforce the same-origin policy instead of browsers(verify), which lets us fetch and parse data rather than resort to code injection. It's better, though not the final answer.


API calls that did JSONP might do so optionally, so that they could both return valid JSON to be parsed as JSON (the parenteses, necessary to wrap a few types, fall away):

([{url:'http://example.com/post1', id:'1', t:['tag','foo']}
  {url:'http://example.com/post2', id:'2', t:['bar','quu']}
])

And at request (typically based on a URL parameter) return the same JSONP-style, like:

callbackfuncname([{url:'http://example.com/post1', id:'1', t:['tag','foo']}
                  {url:'http://example.com/post2', id:'2', t:['bar','quu']}
])


Reasons to not use JSONP include:

  • security: you are basically asking the the browser to do remote code execution
this is a bad idea unless trust the source absolutely always
  • these days, CORS is often more convenient
avoids the above-mentioned issues
  • there were other ways to work around the XHR-same-source policy issue before CORS, e.g. proxying that remote script via your own domain (avoids execution injection)
  • error handling - it works, or it doesn't. You don't have a request object to inspect when it fails
JSON-LD

Encodes linked data in JSON on a webpage.

See Knowledge_representation_/_Semantic_annotation_/_structured_data_/_linked_data_on_the_web#JSON-LD

Multiple JSON objects together

Some people and software like JSON as a structured way to store larger amounts of data than a single object.


One way is to put put one JSON object per line in a text file (and avoid newlines within object, basically meaning 'not pretty printed')

And why have one standard when you can have multiple not-really-standards?

JSONL (.jsonl) and
NDJSON (.ndjson) and
LDJSON (.ldjson or .ldj) (not to be confused with JSON-LD)

...are usually the same thing and compatible, if differing in some spec details.

Notes that when regular JSON parsers see such a file, they will either consider the file as a whole not valid JSON, or only return the first object. As such, you have to know a file is structured like this, because you have to split the lines yourself.



Another way is Concatenated JSON, which requires a parser to realize that after a complete object, it should emit / consider another.


Another way is JSON Text Sequences (RFC 7464), which use a specific delimiter


Another way is Length-prefixed JSON, which work something like netstrings



See also:

"Invalid label" error
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The usual cause is that the server side sends JSON that is not technically correct.

It seems like eval() is peculiar when you do not add brackets around certain objects. Add them for robustness, preferably at the server side, but you could also do it in javascript like:

var o = eval( '('+instring+')' );


Also, note that some serialization formats merely look a lot like JSON. For example, python that does a repr() on a dict looks very similar to JSON, and may sometimes be valid as it, but often is not.

Unsorted (JSON)

JSON is potentially less secure than pure-data exchange, in that you must know that the source of the JSON is trusted to know it doesn't contain bad code that gets fed directly to the JS interpreter.

Even if JSON usually deals with data, it may also contain code, which will get executed and so can easily enable things like XSS exploits.

This is only a problem if the code can be tricked to either load from a different place than it ought to load from, or if the server side can be tricked into sending something potentially dangerous, neither of which is generally likely.


TypeError: 0 is not JSON serializable

Meaning Python's json library.

In general, just inspect the type()

If it's an integer, it's probably a numpy scalar / 0D array.

See also (JSON)

YAML

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

YAML (YAML Ain't a Markup Language) is a data serialization format intended to be readable and writable by both computers and humans, stored in plain text.

A lot of YAML values aren't delimited other than that they follow the basic YAML syntax. Only complex features and complex strings (binary data, unicode and such) will require more reading up.


One nice thing is that YAML spends a lot language definition to push complexity into the parser and away from a person that wants to write YAML.

Various things are possible in two styles (one more compact, the other a little more readable).

It does rely on indenting, which some people don't like (seemingly mostly those with editors that don't care so much about about whitespace). There are also some complexities that may be powerful (e.g. allowing cyclic graphs) but have shown to cause a lot of confusion.


There are some rough edges, like that the implicit typing means that yes may become a bool when you wanted a string,

you may not require its ability to store cyclic graphs, or the implication that there isn't necessarily a fully equivalent expression as data,


Yes, you can get specific behaviour via schemas (if you use a YAML 1.2 compliant parser, not in 1.1), but I'm not sure I consider that more than a workaround.



Scalars

Null:

~
null

Integers (dec, hex, oct):

1234
0x4D2 
02322

Floats:

1.2
0.
1e3
-3.1e+10 
2.7e-3
.inf
-.inf
.nan

Booleans:

true
false

Basic syntax

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Items are often split by explicit syntax or unindents, e.g.:

foo: [1,2]
bar: [3]

foo:
  - 1
  - 2
bar:
  - 3

quu: {
   r: 3.14,
   i: 0.
}


Splitting documents(/records):

---

Usuallly sit on its own line, doesn't have to. Not unusually followed by a #comment on what the next item is.

If you want to emit many records and mark start and end, use --- and ...



Composite/structure types

Lists

- milk
- pumpkin pie
- eggs 
- juice

or inline style:

[milk, pumpkin pie, eggs, juice]


Maps:

name: John Smith
age: 33

or inline style:

{name: John Smith, age: 33}


Comments

# comment

Strings, data

Strings

most any text
"quoted if you like it"

Also:

"unicode \u2222 \U00002222"
"bytestrings \xc2"

Note that strings are taken to be unicode strings, and there is no formally distinguishing it with bytestrings.

(If you want to distinguish them and/or want the exact string type detail to be preserved through YAML, you may want to use a tag (perhaps !!binary for base64 data), or perhaps code some schema/field-specific assumptions)

Mixing strctures

Generally works how you would expect it, given that YAML is indent-sensitive.

For example, a nested list:

- features
  - feature 1
  - feature 2
- caveats
  - caveat 1
  - caveat 2

...a list of hashes:

- {name: John Smith, age: 33}
- name: Mary Sue
  age: 27

...a hash containing lists:

men: [John Smith, Bill Jones]
women:
  - Mary Sue
  - Susan Williams

Tags

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

YAML has a duck-type sort of parser, which won't always do what you want.

Tags force parsing as a specific type, e.g.

not_a_date: !!str 2009-12-12
flt: !!float 123

The tags that currently have meaning in YAML include:

!!null
!!int
!!float
!!bool
!!str
!!binary
!!timestamp

!!seq
!!map
!!omap
!!pairs
!!set

!!merge
!!value 
!!yaml

You can also include your own.


See also:

Advanced features

Relations (anchor/alias)

Further notes

Note that the inline style for the basic data structures (string, numbers, lists, hashes) is often close to JSON syntax, may occasionally be valid JSON. JSON can be seen as a subset of YAML. (apparently specifically after YAML 1.2, because of Unicode handling details)

Given some constraints, you can probably produce text that can be parsed by both YAML and JSON parsers.


StrictYAML

TOML

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

TOML (Tom's Obvious, Minimal Language) seems to be an attempt to be a human-usable config-file format.

  • compared to JSON, TOML has a date format and can be a little easier to type
  • compared to INI, TOML has types, simpler TOML may look almost exactly like INI but more complex data is also possible
  • compared to YAML, TOML has fewer edge cases (but when you avoid those, it's just... another thing used for config stuff)


Short example (from its README):

title = "TOML Example"
 
[owner]
name = "Tom Preston-Werner"
dob = 1979-05-27T07:32:00-08:00   # First class dates
 
[database]
enabled = true
ports = [ 8000, 8001, 8002 ]
data = [ ["delta", "phi"], [3.14] ]
temp_targets = { cpu = 79.5, case = 72.0 }


It's meant to be thought of as a hash table (with possible nesting), which makes it usable under various languages that have those as first-class types.



https://github.com/toml-lang/toml/blob/main/README.md

https://stackoverflow.com/questions/65283208/toml-vs-yaml-vs-strictyaml

INI

INI files ('initialization', from the 8.3 days)



Netstrings

Netstrings are a way of transmitting bytestrings by prepending their length, instead of delimiting them.

This makes it easy to place them in other datastreams/protocols is a well-defined way.


See also:

MessagePack

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Messagepack, a.k.a. msgpack, is a binary serialization format meant for lightweight interchange.


The pitch seems to be "like JSON but more compact and more precisely typed".

Note that it also parses faster than JSON.


Types includes

nil
boolean (true and false)
int 0 8 to 64 bits, signed and unsigned)
float - single or double precision IEEE
str - UTF-8 string
bin - binary data
ext
array
map, an associative array
timestamp, one of three options:
32-bit seconds (good until 2106)
64-bit nanoseconds (34 bits for seconds, good until 2514 AD)
96-bit nanoseconds (64 bits for seconds, good until 292277026596 AD)


There was some confusion around distinguishing binary strings and unicode strings, which matters to practicalities in different languages.


See also:

Bencode

Bencode is similar to netstring, but can also code numbers, lists, and dictionaries. Used in the BitTorrent protocol.

Apparently not really formally defined(verify), but summarized easily and well enough.

See also:


XML

While XML itself has upsides over self-cooked solutions (e.g. in that parsing, encoding, and error handling is well-defined and done for you), its delimited nature and character restrictions mean it is not an easy or efficient way to transport binary data without imposing some extra non-XML encoding/decoding step. (CDATA almost gets there, but requires that the string "]]>" never appears in your data.)

One trick I've seen used more than once (e.g. filenames in a xml-based database, favicons in bookmark formats) is to encode data in Base64 or URL encoding. This is fairly easy to encode and decode, and transforms all data into ASCII free of the delimiters XML uses (and byte values XML specifically disallows). It's safe, but does waste space/bandwidth.

Of course, storage of arbitrary binary data is often not strictly necessary, or a rare enough use that overhead is not a large problem.


Serialization

At its core, XML is markup format for data, primarily focusing on data in text form (because there are limitations, see On control codes and arbitrary binary data in XML).

Its structure is well-defined, it is possible to write a relatively simple and pretty fast parser (many aren't as fast as they could be), and it is well-defined in terms of [[text coding].

This has made it a convenient transmission format, particularly for interoperability. It's generally not as fast to parse or as space-efficient as binary formats, but usually saves a lot of bother (mess, work, interoperability problems, possibly speed) that comes with custom solutions.

From this perspective you could say that it is not a data format in itself, it is a (text) representation for one, and even that XML is primarily about serialization.

Restrictions and semantics

Serialization doesn't cover everything, though. There is much to be said for adding strong restrictions with that serialization, fitting data to a well-settled structure, so that there are guarantees and you can actually make some assumptions while using the data, and define transforms that make sense.

This of course has little to do with XML per se - semantics and assumptions are useful because they are, not because this is related to XML. However, XML has become associated with this sort of approach because it is one of the popular formats used (see e.g. semantic web), and also because there are systems that work on the XML directly (such as XSLT), and they have to have such strong restrictions.

Criticism

XML's widespread use in representing information (currently) makes it a low-barrier convenience. XML was a historically appropriate improvement, a step in the right direction.

However, for much the same reasons it should not be approached without criticism, because in many uses it is not the best solution (in terms of things like complexity, efficiency) for the useful functionality it provides, and because many people now use it out of default rather than because/when it's a good idea.

In itself, it doesn't standardize anything except itself, for meanings of standardization worth mentioning in practical reality.

(Of course, it shouldn't get criticism purely because of common abuse either.) -->




Correctness

Well formedness largely excludes things that a parser could trip over.

Structure constraints:

  • well-balanced
  • properly nested
  • contains one or more elements
  • contains exactly one root element

Syntax constraints

  • does not use invalid characters (see section below)
  • & is not used in a meaning other than starting a character reference (except in CDATA)
  • < is not used in a meaning other than starting a tag (except in CDATA)
  • entities
    • does not use undeclared named entities (such as HTML entities in non-XHTML XML)
  • attributes:
    • attribute names may not appear more than once per tag
    • attribute values do not contain <
  • comments:
    • <!-- ends with -->, and may not contain --, and does not end with --->
    • cannot be nested. The attempt leads to syntax errors.


Valid documents must:

  • be well-formed
  • contain a prologue
  • be valid to a DTD if they refer to one.

Well-formed documents are not necessarily valid.


Characters that you can use is:

  • U+09
  • U+0A
  • U+0D
  • U+20 to U+D7FF
  • U+E000 to U+FFFD
  • U+10000 to U+10FFFF

Another way to describe that is "All unicode under the current cap, except...":

  • ASCII control codes other than newline, carriage return, and tab - so 0x00-0x09, 0x0b,0x0c, 0x0e, 0x0f are not allowed
  • Surrogates (U+D800 to U+DFFF)
  • U+FFFE and U+FFFF (semi-special non-characters) (perhaps to avoid BOM confusion?)


Note that this means binary data can't be stored directly. Percent-escaping and base64 are not unusual for this.


URIs have no special status, which means they are simply text and should be entity-escaped as such.

For example, & characters used in URLs should appear as &amp;.

(Note that w3 suggests ; as an alternative to & in URIs to avoid these escaping problems, but check that your server-side form parsing code knows about this before you actually use this)


As to special characters:

  • non-ASCII characters are UTF8 escaped (to ensure that only byte-range values appear)
  • disallowed characters in the result are %-hex-escaped, which includes:
    • 0x00 to 0x1F, 0x20 (note this includes newline, carriage return and tab)
    • 0x7F and non-ASCII (note this includes all UTF-8 bytes)
    • <>"}|`^[]\



Avro

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Typed data exchange, using a more dynamic-type, schema'd protocol than things like Apache Thrift, Protocol Buffers and similar. Does not need pregenerated code.

Written around JSON.

See also:



Others

See e.g.

  • XDR
  • Apache's Avro, a more dynamic-type, schema'd protocol that does not need pregenerated code. Written around JSON.
  • Language-specific serialization implementations can also be convenient (e.g. in memcaches), but with obvious drawbacks when cooperating between multiple languages.

Specific

Pickle

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Message brokers/queues; job management

Introduction, and properties

Message queues are useful to send messages when more than two endpoints are involved.

...between different part of the same application on the same host, between hosts, within and between datacenters, from embedded sensors, etc.



Details beyond that vary a lot, like

whether it does "deliver as soon as possible" messages and/or "keep until someone asks and consumes" job queues
whether it's working from a central broker or not
delivery model
ordered (e.g. fifo) or not
single consumer, broadcast, pub-sub, other
one-way or two-way
awareness of balancing
whether things are removed only once acknowledged (useful for job queues)
whether there are priorities in the queue
whether it tries for low low latency or not
whether delivery is acknowledged, whether you notified of failures to deliver
whether retry is a thing, and messages are kept until delivery, or will be removed after some time regardless
whether messages can be federated
whether it's backed by disk, so that a restart won't lose what was still queued
whether security is a concern
whether it tries to avoid blocking publishers (brokers are an easy and common way to do so)
how it deals with backlogs
whether it avoids being/having single points of failure

Note that some of the above are at odds with each other, and you may want a messaging system that gives you a choice in the matter.


In practice,

  • many are publish–subscribe style, probably largely because it is common for parts of a system to have selective interest in other parts, so having the concept of selective-and-configurable-interest be part of the protocol can be both convenient, and more efficient and scalable (filtering at/near the source is often better than broadcasting everything everywhere).
  • many use a central, separate broker
relates to the guarantees you can make, necessity of service discovery - it makes various design easier
yet a broker
is a potential point of congestion (but if it merely coordinates, it's still often better than everybody-broadcasts)
is a potential (central) point of failure
may allow you to consider interesting topologies (e.g. relaying beteen datacenters).
...so some systems make a point of being brokerless/peer-to-peer, or can be.
can make more sense to alleviate some of the mentioned issues.
may just means a broker is automatically (re)elected
which may also mean there is some period where messages can get lost
There are other sensible steps between brokered and brokerless, and reasons to consider them complementary. See e.g. [5] [6]
  • many are in-memory only for speed reasons
though can be disk backed (often "snapshot every x time") to not lose very much


Publish-subscribe

On protocols

Aside from things that use their own protocol, there are a few that are shared, including:

STOMP

STOMP (Simple Text-Oriented Messaging Protocol)

more of a wire protocol, doesn't have brokering models itself
http://stomp.github.io/
OpenWire
http://activemq.apache.org/configuring-wire-formats.html
binary protocol
activeMQ default
AMQP
http://activemq.apache.org/amqp.html
an open ISO standard
peer-to-peer, optionally brokered
MQTT
http://mqtt.org/
designed to be lightweight, e.g. for simple hardware like embedded, IoT stuff
brokered-only (so needs a known server)

Has some retransmission, useful on unreliable networks

...though this is sometimes overstated - you do not want this


Libraries tend to support MQTT over SSL(verify), and you could use authentication.

💤 Actually two versions
  • the original over TCP/IP
  • MQTT-SN, over UDP and some other other transports, which seems barely used right now


Concepts:

  • message - bytestring
  • topic - a tree-like way of organizing specific things to subscribe to
the broker doesn't really keep a list of what exists, that's between origins and clients have to know and be consistent about
you can wildcard to subscribe to entire subtrees, and to named nodes under multiple branches
  • broker - central server
connected by IP address
knows about client subscriptions,
removes the message once relayed.
E.g. Mosquitto, Mosca, RSMB, HiveMQ
  • QoS thinks somewhat about unreliable environments
level 0 is fire and forget; okay to lose messages (but on TCP/IP you're usually fine anyway)
level 1 will re-transmit until acknowledged (or timed out?)
level 2 will re-transmit, while also ensuring all clients get it at most once (which is more work/overhead. Not all brokers even choose to support it)


Limits:

  • message size may be limited by the broker
might be up to 256MiB[7]
...but given the kind of devices you may be working with, you should probably think of messages in the kilobytes at most
  • there is often a queue size limit set by broker
which may futher vary vary with QoS
  • topic name limit is 64K bytes (note: is UTF-8 so may be as few as 16K codepoints)[8]
but again, you probably want to stay well under that
client ID - 64 characters (or bytes?)


Brokers that will have to handle more than a tiny amount of data should probably be run on something with some CPU and bandwidth, which is why in DIY it's often run on something like a Raspberry Pi.

In things that stay small you could run it on more minimal hardware, like uMQTTBroker that runs on an ESP8266.


See also:

Software

0MQ

While most of the below focuses on being network daemons, things like 0MQ are a C library, basically the basic conveniences you'd like but leaving the rest up to you, and can be a good choice e.g. to coordinate within a single app.

http://zeromq.org/

RabbitMQ

Speaks STOMP, AMQP, MQTT, and some things over HTTP.


https://www.rabbitmq.com/


Kafka

Speaks its own protocol


Java

May well want to combine it with zookeeper


https://kafka.apache.org/

Celery

https://docs.celeryproject.org/en/stable/
  • A server that applies python functions to incoming messages
seems made so that you can run functions out of existing codebase and use them for background jobs, without having to separate them
often used for asynchronous background work
  • typically brokered, because...
typically messaging through RabbitMQ (AMQP transport) or Redis
other options include Amazon SQS, IronMQ, MongoDB, CouchDB, zookeeper, and databases via SQLAlchemy or Django's ORM. (These are marked experimental and/or may not have active maintainers)
  • by default all on the same queue, but you can create multiple
  • can (optionally) store results
to one of a dozen types of storage backends


You don't need to trigger this with python (there are APIs for other languages), but the code you run is python so it can be a more natural fit to an already-python project.

...also because the way you declare the code that needs to get run can be part of the codebase you want to trigger it from.


Quick start:

  • install celery
  • choose and install broker
  • start the worker
  • call tasks

The example from the tutorial is a a tasks.py containing

#!/usr/bin/python

from celery import Celery

app = Celery('tasks', broker='pyamqp://guest@localhost//')

@app.task
def add(x, y):
    return x + y

Starting the worker process (here foreground for ease of example testing, you would actually daemonize this)

celery -A tasks worker --loglevel=INFO

And calling that task like

from tasks import add
add.delay(4,5)

delay() returns an AsyncResult that

  • you can inspect state of via ready(), successful()/failed(), and traceback if it failed
  • you can collect() results from if you configured a backend


Getting the result (get(), collect(), wait()) is only possible if you also configured a storage backend, e.g. redis.


Notes:

  • if you use a storage backend, you must get() or forget() every AsyncResult
in some cases it makes more sense to have the task interact with your own storage, and using celery only for success/failure
  • For fire-and-forget tasks you would probably do
tasks.add.delay(4,5).forget()
  • if you change the config of the celery app object probably also want to reload any persistent workers that use it


https://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#id8

https://www.youtube.com/watch?v=EWFd_CiBJbc

ActiveMQ

Speaks OpenWire (its own), Stomp, AMQP, MQTT


Java


http://activemq.apache.org/

Qpid

Speaks AMQP


https://qpid.apache.org/

Kestrel

Speaks its own protocol

Scala (on JVM)

https://github.com/twitter-archive/kestrel

Beanstalkd, beanstalkg

https://beanstalkd.github.io/

https://github.com/beanstalkg/beanstalkg

Darner

https://github.com/wavii/darner

Delayed_job

https://github.com/collectiveidea/delayed_job

Disque

https://github.com/antirez/disque


Faktory

https://github.com/contribsys/faktory


Gearman

http://gearman.org/


HornetQ

http://hornetq.jboss.org/

Heuy

https://huey.readthedocs.io/en/latest/


IronMQ

https://www.iron.io/mq

Keu

https://github.com/Automattic/kue

Mappedbus

http://mappedbus.io/


nanomsg

https://nanomsg.org/


NATS

https://nats.io/

nsq

https://github.com/nsqio/nsq

Unsorted

Openstack projects related to storage:

  • SWIFT - Object Store. Distributed, eventually consistent
  • CINDER - Block Storage
  • MANILA - Shared Filesystems
  • KARBOR - Application Data Protection as a Service
  • FREEZER - Backup, Restore, and Disaster Recovery


See also


Semi-sorted

Apache Parquet

A column-oriented data file format.

https://parquet.apache.org/


Apache Thrift

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Automatically create programming interfaces in Java, C#, Python, C++, Ruby, Perl, PHP, Smalltalk, OCaml, Erlang, Haskell, Cocoa, Squeakr based on a service description.


Compared to protobuf

  • much the same basic idea
  • thrift seems geared a little more towards an RPC (much like gRPC) protocol,
  • thrift supports a few more languages
  • thrift fairly easily combined with a software stack (using boost) to run services as independent daemons.


As an example, there are thrift descriptions bundled with Hadoop that let you interface with HDFS, HBase and such.



CRUD

CRUD (Create, Read, Update and Delete) refers to the four basic operations in persistent storage.

The term seems to come from data-accessing GUI design.

Things like HTTP, SQL, filesystems, have these as central operations, with various semantics attached.


It is now also used do describe any other API that exposes these operations fairly specifically. In fact, when people say REST, they may mean CRUD more than REST.


CRUD is now often brought up in reference to these semantics.

...or sometimes the fact that some of them are prioritized a bunch, e.g. timeseries database focusing on create and read, whereas update and delete is secondary


Protocol buffers

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

IdeaLingua

Things like protobuf, versus things like JSON

CoAP

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Constrained Application Protocol (CoAP) is a UDP/IP protocol, with a REST-style interface

URL
commands like GET, PUT, POST, and DELETE
payloads (like JSON, XML, CBOR))

And has discovery

Aimed at machine-to-machine interfaces (IoT if you're fancy),


https://coap.technology/

ASN.1 notes

Summary

ASN.1 (Abstract Syntax Notation One) is used to store and communicate structured data. It defines a number of types, and allows nesting.

The 'abstract' is there because the typed data structures by themselves do not imply the ability to communicate.


...though the ways to serialize the data structures are closely related. They include:

  • DER (Distinguished Encoding Rules) - seems a fairly simple byte coding
  • BER (Basic Encoding rules) (flexible enough to be more complex to implement (fully) than most others(verify))
  • CER (Canonical Encoding Rules)
  • PER (Packed Encoding Rules) is more compact than DER, but requires that the receiving end knows the ASN.1 syntax that was used to encode the data
  • XER (XML Encoding Rules)

Most software uses just one of these. It seems DER and BER are most common(verify).


ASN.1 and its coding rules can be seen as an alternative to, for example, (A)BNF style data descriptions, XML, custom data packing, JSON, and such.

ASN.1 is useful in defining platform-independent standards as it is fairly precise and still abstract. It does not seem very common, perhaps because its use and typing can be overly complex, and codings like BER can be more bloated (amount-of-bytes-wise) than you might want in simple exchanges, though DER is efficient enough for a lot of things.

ASN (and usually the encoding rules) are used in places like SSL, X.509, LDAP, H.323 (VoIP), SNMP, Z39.50, and a number of pretty specific, niche-like uses (e.g. this one in medicine), and others.


Notes on data types

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The tag numbers mentioned here enumerate ASN.1 types (Universal Tag), of which there are about 30.


Notes on strings

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

String types seem to define both encoding (byte coding unless otherwise mentioned) and allowed character set.

  • NumericString (tag 18) - 0-9 and space
  • GraphicString (tag 25) -
  • VisibleString (tag 26) - Visible characters of IA5 (mostly / subset of ASCII?(verify) also called ISO646String?(verify))
  • IA5String (tag 22) - International (Reference) Alphabet 5, see ITU-T T.50[9]
  • UniversalString (tag 28) - ISO10646 (Unicode) character set (UCS4?(verify))
  • BMPString (tag 30) - (UCS-2 encoded characters from the ISO10646 Basic Multilingual Plane)
  • PrintableString (tag 19) - [A-Za-z0-9'()+,-./:=?] and space
  • VideotexString / T61String (tag 20) (CCITT Recommendation T.61[10]. Also ITU-T 100 plus ITU-T 101 characters?(verify) )
  • TeletexString - See ITU-T T.101 [11]
  • UTF8String (tag 12)

Notes on serialization

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

BER

See also (ASN.1)

General information:

Standards, reference:

Unsorted:


Libraries

There are libaries that will turn the syntax into parsers, and/or convert specific data coded in specific encoding rules into data structures (often limited, e.g. to BER, DER and PER)

  • C, C++: [12] (most encoding rules)
  • Python: [13]
  • Java, .NET: [14]
  • Java: [15]
  • ...and others.