Difference between revisions of "Programming notes/Communicated state and calls"

From Helpful
Jump to: navigation, search
m
m
Line 90: Line 90:
 
* the web-geared [[SOAP]] (see below)
 
* the web-geared [[SOAP]] (see below)
  
 
* [http://code.google.com/p/protobuf/ Protocol Buffers] - Codes mostly scalars, Smaller/faster than various other things. Meant to be language-specific/agnostic and support at least C, Python, Java. Self-describing in terms of parsing, but needs agreement on data definition
 
  
 
* Apache's [http://incubator.apache.org/thrift/ Thrift], geared to expose something something local via a daemon<!--, using some pre-set definition (mostly scalars). Given a data definition, it gives boilerplate interaction code, and a service)-->
 
* Apache's [http://incubator.apache.org/thrift/ Thrift], geared to expose something something local via a daemon<!--, using some pre-set definition (mostly scalars). Given a data definition, it gives boilerplate interaction code, and a service)-->
  
  
<!--
 
  
Protobuf notes
 
 
Ideas:
 
: self-defined (e.g. type-wise)
 
 
: parsed into the potentially complex data you wanted, often an object structure
 
 
: since it is somewhat self-describing, you can generate code for serializers/deserializers
 
 
 
 
 
 
 
 
 
Criticism
 
 
 
 
-->
 
  
  
Line 1,224: Line 1,200:
 
==Protocol buffers==
 
==Protocol buffers==
 
<!--
 
<!--
Also known as protobuf, protocol buffers is a binary coding system for structured data
+
Also known as protobuf, protocol buffers is a low-overhead portable binary coding.
 +
 
 +
Not self-describing in terms of parsing, so needs agreement on data definition.
 +
The .proto files are basically schemas.
 +
 
  
Basically a faster variant to JSON / XML,
+
Meant to be language-specific/agnostic and support at least C, Python, Java (because those are Google's main languages).
with some considerations for coders (e.g. numbered field make it easier to update the schema without breaking old code parsing).
+
  
* with schema support
+
Has some some considerations for coders, e.g. numbered field make it easier to update the schema without breaking old code's use.
** most common primitives, can build on other specified types, has things like optional/required fields, enums, repeatable fields
+
  
* can generate code, that is, takes a schema specification and provides code for a class with accessors and going from/to bytes
+
Upsides:
 +
: By itself focuses more on common scalar primitives
 +
: Has things like optional/required fields, enums, repeatable fields, maps, and can build on other specified types.
 +
: when you need a pre-defined, fast, and compact interchange, this is way better than e.g. XML and schemas
 +
: when you work between languages, the code generation saves a little time
 +
: has things like associative maps
 +
: can generate code, in that it can takes a .proto provides code to parse and present a class with accessors
  
  
 +
Criticism:
 +
: has some weird edge cases of what's not allowed
 +
: tries to be both transfer format and application format. If the latter isn't a good fit, it actually makes for ''more'' work within the application (basically proxying to your real data structures) - because you probably don't want your serialization dictating your application's data model.
 +
: needs agreement of all sides, so changes take more time and care to do correctly
 +
: if you forget something, i'll silently do the wrong thing
 +
: ...so it's not as robust to changes as promised. Meaning you need to consider things like:
 +
:: adding things to the .proto you can check, like versioning. Which is about as messy as schemaless is
 +
:: creating your own
 +
: can become overengineered, which doesn't help the previous point
  
  
 
See also:
 
See also:
 +
* http://code.google.com/p/protobuf/
 
* https://developers.google.com/protocol-buffers/docs/overview
 
* https://developers.google.com/protocol-buffers/docs/overview
 
* https://developers.google.com/protocol-buffers/docs/proto
 
* https://developers.google.com/protocol-buffers/docs/proto

Revision as of 10:33, 29 July 2020

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Contents

Local(ish) and Network geared

Varying technologies, techniques, and implementation types include the following:

'IPC'

Inter-process communication is a generic name, and some use it for many of the local-ish methods, dealing with one or more of:

  • message passing
  • synchronization
  • shared memory
  • remote procedure calls


In the broad sense, this can describe things of varying type, complexity, speed, goals, guarantees, and cross-platformness, including:

  • File-related methods (e.g. locking)
  • Sockets (network, local)
  • Message queues
  • process signals
  • named pipes
  • anonymous pipes
  • semaphores
  • shared memory

...and more.


One of the easiest methods to get cross-platformness is probably the use of (network) sockets - mostly since Windows isn't POSIX compliant, and most other things are even less cross-platform.

Threads (same-process)

In multi-threaded applications, IPC-like methods are sometimes used for safe message passing and perhaps because of convenience (some can ease communication between multiple threads spread among multiple processes).


Same-computer

Fast same-computer (same-kernel, really) process interaction:

  • POSIX named pipes
  • POSIX (anonymous) pipes
  • POSIX shared memory
  • POSIX semaphore
  • SysV IPC (queues, semaphores, shared memory)
  • unix sockets (non-routable network-style IO on unix-stlye OSes), not unlike...
  • windows' LPC

Interaction between applications, nodes in a cluster, etc.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Relatively manual

That is, interoperation support mechanisms that leave their use up to you.

See also MPI notes


Networkable application embedding
  • DCOP (KDE 2, 3)
  • D-Bus (KDE 4, GNOME, others)


Some of the relatively general-purpose ones (see below) may also be practical enough

Relatively general-purpose (networked)

...and often cross-language frameworks (that often create fairly specific-purpose protocols), such as in:

  • RPC- "Remote Procedure Call", usually an explicitly exposed API, used via some data serialization. See #RPC variations
  • the web-geared SOAP (see below)


  • Apache's Thrift, geared to expose something something local via a daemon




Language-specific data transport frameworks

...such as


Fairly narrow-purpose protocols

...like

  • Flash's remoting
  • Various proprietary protocols (some of which are mentioned above)


RPC variations

A Remote Procedure Call is a function/procedure call that is not necessarily inside a process' local/usual context/namespace, often done via a network socket (even if calling something locally), also often on another computer, and possibly across the internet.

RPC may in some cases be nothing more than a medium that carries function calls to the place they should go, or a means of modularizing a system into elements that communicate using data rather than with linked code.


XML-RPC

XML-RPC is a Remote Procedure Call language that communicates via a simple, standardized XML protocol.

It is mainly a way to move function calls between computers. It adds typing, although only for simpler data like numbers and strings.


It is not the fastest or most expressive way to do remote calls, but is simpler than most others and can be very convenient, and not even that slow when implemented well. Usually, the focus is one of:

  • Run code elsewhere as transparently as possible
  • Provide a service over a well defined network interface
  • Provide a webservice

Arguably there's not much difference, but there can be in terms of what an implementation requires you to do: Does it make exising functions usable? All? Implicitly or explicitly? How much code do you need to actually hook in new functions? You'll run into this difference in example code and explanation when you look around for implementations.


See also:

XML-RPC, transparency angle:

XML-RPC, webservice angle:


ONC-RPC
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Open Network Computing RPC, the basis of NFS

See also:


Others
  • ISO RPC
  • MSRPC (a modified DCE/RPC) [1]


Unsorted

  • DDE
  • COM, which seems an umbrella term for at least OLE, ActiveX, COM+, DCOM
  • .NET remoting [2]
  • WCF [3]


ICE

.ICEauthority is part of Inter-Client Exchange (ICE) Protocol,

Message brokers / queues; job management

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Introduction, and properties

Message queues are useful to send messages when more than two endpoints are involved. Between different part of the same application on the same host, between hosts, within and between datacenters, etc.

In practice, many are publish–subscribe style, largely because making selective interest part of the protocol can be both convenient, and efficient/scalable (filtering at/near the source is often better than broadcasting everything everywhere).


Whether pub/sub or other, it's also common to do this via a central, separate broker, because this eases how (and how well) guarantees work, means no necessity of service discovery, and makes other design easier. It is a central point of failure. And potentially a point of congestion, but it's still often better than everybody-broadcasts, and they may allow you to consider interesting topologies (e.g. relaying beteen datacenters).

Some are brokerless/peer-to-peer, or can be. This can make more sense to avoid the single point of failure, or when your setup is less arbitrary-workers and e.g. more stream-data-through-a-bunch-of-sequential-steps.

There are other sensible inbetweens. And even reasons to consider them complementary. See e.g. [4] [5]


A common application is managing the jobs that need to be done. So various queues add features with this in mind, like "store on disk so we don't lose stuff at reboot", and allowing "only remove once acknowledged within X time", which makes it a lot easier to hand out a job to someone else when your first worker seems to have crashed, be a straggler, never got it due to network error (which is hard to tell apart from 'no response', quite fundamentally).

...whereas in-memory-only, remove-once-delivered can be enough for e.g. aggregation within sensor networks, and great on resource-limited platforms, as they just want to hear what's there, and don't need backlogs, replayability, and such.


Generally useful properties

  • avoid blocking publishers
(which is why a broker that does its best is a common solution)
  • optional delivery acknowledgement / guaranteed delivery
  • have multiple-receiver queues
  • deal with backlogs
  • avoid being/having single points of failure
  • low latency
(because some of these have origins in high-frequency trading)


  • transaction management
  • federation
  • security


Note that some of the above are at odds with each other, and you may want a messaging system that gives you a choice in the matter.

On protocols

Aside from things that use their own protocol, there are a few that are shared, including:

STOMP

STOMP (Simple Text-Oriented Messaging Protocol)

more of a wire protocol, doesn't have brokering models itself
http://stomp.github.io/
OpenWire
binary
http://activemq.apache.org/configuring-wire-formats.html
activeMQ default
AMQP
http://activemq.apache.org/amqp.html
an open ISO standard
peer-to-peer, optionally brokered
MQTT
http://mqtt.org/
designed to be lightweight, e.g. for simple hardware like embedded, IoT stuff
brokered-only

Has some retransmission, useful on unreliable networks (though this is sometimes overstated).



0MQ

While most of the below focuses on being network daemons; things like 0MQ is a C library, basically the basic conveniences you'd like but leaving the rest up to you, and can be a good choice e.g. to coordinate within a single app.

http://zeromq.org/

RabbitMQ

Speaks STOMP, AMQP, MQTT, and some things over HTTP. https://www.rabbitmq.com/


Kafka

Speaks its own protocol

https://kafka.apache.org/

Celery

Speaks its own protocol

http://www.celeryproject.org/


ActiveMQ

Speaks OpenWire (its own), Stomp, AMQP, MQTT http://activemq.apache.org/

Qpid

Speaks AMQP https://qpid.apache.org/

Kestrel

https://github.com/twitter-archive/kestrel

Beanstalkd, beanstalkg

https://beanstalkd.github.io/

https://github.com/beanstalkg/beanstalkg

Darner

https://github.com/wavii/darner

Delayed_job

https://github.com/collectiveidea/delayed_job

Disque

https://github.com/antirez/disque


Faktory

https://github.com/contribsys/faktory


Gearman

http://gearman.org/


HornetQ

http://hornetq.jboss.org/

Heuy

https://huey.readthedocs.io/en/latest/


IronMQ

https://www.iron.io/mq

Keu

https://github.com/Automattic/kue

Mappedbus

http://mappedbus.io/


nanomsg

https://nanomsg.org/


NATS

https://nats.io/

nsq

https://github.com/nsqio/nsq

Services

Amazon MQ

=Amazon Simple Queue Service

Web-geared

REST

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A convention/formalism, with the idea to give a very basic interface to state, in a per-object, I-verb-this-noun way.


In contrast, things like RPC tends to just expose underlying functions, which can easily be succinct, and a little more efficient, but may require an understanding of the codebase, and may let you do incorrect things.

In part, REST tends to be cleaner because it's part of the design (sometimes a bit too architecty, even), more than an afterthought. In part, its focus on resources, their identifiers, and operations on them forcefully separates things, and so RESTful parts of a larger system are less likely to have strangle entangled levels of coupling.

That often makes accesses to and between RESTful systems easier to describe, more flexible, lets clients be mostly stateleless, and eases cacheing.


Somewhat more technically:

  • figure out the resource types,
  • define an entry point per type, and then how to address instances
  • define how CRUD-style operations work on instances


For example, for HTTP you could decide to make it obvious in the URL what you are accessing, then do every useful action with a combination of GET, POST, and DELETE, say:

  • GET www.example.com/blogpost/2
  • POST www.example.com/blogpost/2
  • DELETE www.example.com/blogpost/2
  • GET www.example.com/user/fooser

HTTP lends itself to REST fairly well, but you can define your own variants.


SOAP

SOAP is a well-typed serialization format in XML, which can be a nice option between systems that are not easily binarily compatible but do want strict and predetermined restrictions on their interchange. (.NET uses it as one of its object serialization formats.


SOAP also describes a protocol for exchanging SOAP data, often used to do remote procedure calls.

This use resembles XML-RPC in various ways (including that you can use it over HTTP to get through firewalls), and could be said to be an improvement over it in terms of typing.

However, you can argue that pragmatism was ignored when SOAP was defined; SOAP implementations may differ on what parts they implement/allow (meaning interoperation can be difficult), SOAPAction's format is a little underspecified, and its use forces you to hook into HTTP. Also, the amount of XML namespaces is mildly ridiculous -- you won't write any SOAP content yourself. You will probably not manage a quick and dirty parser. You wil likely need a decent SOAP implementation.

Its verbosity also makes some implementations noticably slower to parse than some other remote-API calling (particularly things like binary RPC), so not as practical for latency-critical needs.


You can argue that this makes SOAP a good serialization format but not a very good RPC medium.

Unsorted

If you want to avoid a SOAP stack (I did because the options open to me at the time were flawed), the main and often only difference with a normal POST request (with the SOAP XML in the body) is that you need to add a SOAPAction header.

The SOAPAction header is used to identify the specific operation being accessed. The exact format of the value is chosen by the service provider.


See also


XML-DA

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Seems to be a a simpler alternative to SOAP when all you want is some basic data access.

See also:


Describing at things you can talk to

Often entangled with one of the above

WSDL

WSDL describes web services, often (but not necessarily) SOAP.

WSDL mostly just mentions:

  • the URL at which you can interact (often using SOAP),
  • The functions you can call there and the structure of the data you should send and will get (using XML-Schema)

WSDL allows you bootstrap SOAP RPC based on a single URL, without writing code for the SOAP yourself. You can use a program that converts WDSL to code, or compiles it on the fly.

This doesn't always work as well as the theory says, though, mostly because of the complex nature of SOAP.


UDDI
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

UDDI (Universal Description, Discovery, and Integration) is a way to list exposed web services.

It is primarily useful as a local (publish-only) service registry, to allow clients to find services and servers: the idea is that it yields WSDL documents, via SOAP.

It was apparently intended to be a centalized, yellow pages type of registry for business-to-business exchange, but is not often used this way.(verify)


See also:

Data and serialization

Serialization formats:

  • Serialization formats made for interchange such as
  • Serialization formats made for storage (and/or readability) are usually more detailed than they are efficient, but can still be handy. Consider
    • YAML
    • JSON (see also the idea of JSON-RPC)
    • XML (doubtful in this list as you might as well use the better defined and XML-based SOAP, unless you are communicate structure more easily expressed in nested nodes -- such as data already in that format).
  • Language-specific serialization implementations can also be convenient (e.g. in memcaches), but with obvious drawbacks when cooperating between multiple languages.


XML

While XML itself has upsides over self-cooked solutions (e.g. in that parsing, encoding, and error handling is well-defined and done for you), its delimited nature and character restrictions mean it is not an easy or efficient way to transport binary data without imposing some extra non-XML encoding/decoding step. (CDATA almost gets there, but requires that the string "]]>" never appears in your data.)

One trick I've seen used more than once (e.g. filenames in a xml-based database, favicons in bookmark formats) is to encode data in Base64 or URL encoding. This is fairly easy to encode and decode, and transforms all data into ASCII free of the delimiters XML uses (and byte values XML specifically disallows). It's safe, but does waste space/bandwidth.

Of course, storage of arbitrary binary data is often not strictly necessary, or a rare enough use that overhead is not a large problem.


Correctness

Well formedness largely excludes things that a parser could trip over.

Structure constraints:

  • well-balanced
  • properly nested
  • contains one or more elements
  • contains exactly one root element

Syntax constraints

  • does not use invalid characters (see section below)
  • & is not used in a meaning other than starting a character reference (except in CDATA)
  • < is not used in a meaning other than starting a tag (except in CDATA)
  • entities
    • does not use undeclared named entities (such as HTML entities in non-XHTML XML)
  • attributes:
    • attribute names may not appear more than once per tag
    • attribute values do not contain <
  • comments:
    • <!-- ends with -->, and may not contain --, and does not end with --->
    • cannot be nested. The attempt leads to syntax errors.


Valid documents must:

  • be well-formed
  • contain a prologue
  • be valid to a DTD if they refer to one.

Well-formed documents are not necessarily valid.


Characters that you can use is:

  • U+09
  • U+0A
  • U+0D
  • U+20 to U+D7FF
  • U+E000 to U+FFFD
  • U+10000 to U+10FFFF

Another way to describe that is "All unicode under the current cap, except...":

  • ASCII control codes other than newline, carriage return, and tab - so 0x00-0x09, 0x0b,0x0c, 0x0e, 0x0f are not allowed
  • Surrogates (U+D800 to U+DFFF)
  • U+FFFE and U+FFFF (semi-special non-characters) (perhaps to avoid BOM confusion?)


Note that this means binary data can't be stored directly. Percent-escaping and base64 are not unusual for this.


URIs have no special status, which means they are simply text and should be entity-escaped as such.

For example, & characters used in URLs should appear as &amp;.

(Note that w3 suggests ; as an alternative to & in URIs to avoid these escaping problems, but check that your server-side form parsing code knows about this before you actually use this)


As to special characters:

  • non-ASCII characters are UTF8 escaped (to ensure that only byte-range values appear)
  • disallowed characters in the result are %-hex-escaped, which includes:
    • 0x00 to 0x1F, 0x20 (note this includes newline, carriage return and tab)
    • 0x7F and non-ASCII (note this includes all UTF-8 bytes)
    • <>"}|`^[]\


JSON

JSON is the idea of using a javascript data structures, serialized as text, as a data exchange format.

It can contain numbers, strings (without worrying about unicode characters), arrays, associative arrays (a.k.a. hashes, dictionaries).

(note that it is not that useful for binary data, as strings are expected to be decoded as unicode. Escaping or text coding (e.g. base85 is decently efficient) can work around that, but meh.)


JavaScript Object Notation rewrites data as strings that Javascript itself could eval() directly (...but you should use a library; eval is potential evil in terms of security. Not always a problem, but it's just a good habit to use a library).


It was initially primarily used to move data from a server into a JavaScript application, through XHR and similar.

You can also write JSON directly into a webpage, for example to push complex data from server-side code to data for javascript running in the browser.


Since javascript's syntax is easy enough to generate and parse, you don't need javascript itself to use it, and JSON has since become a more generic interchange format, in part because it is rather convenient in languages that share its basic types (hashmaps in particular), in part because it lets you easily push data to browser scripting from anywhere.


See RFC 4627.



JSONP

JSONP refers to letting the client specify a bit of text to wrap the JSON data it returns.


For example, say you have a web API that returns some blog post metadata:

([{url:'http://example.com/post1', id:'1', t:['tag','foo']}
  {url:'http://example.com/post2', id:'2', t:['bar','quu']}
])

If you want to just load that, you can feed that straight to an eval:

blog.posts = eval( fetch("http://api.example.com" )
// so that you then do  blog.posts[0].url


With JSONP you can often do something like:

eval( fetch("http://api.example.com?jsonp=callbackfuncname" )

The server would respond with something like:

callbackfuncname([{url:'http://example.com/post1', id:'1', t:['tag','foo']}
                  {url:'http://example.com/post2', id:'2', t:['bar','quu']}
])

...which, once you eval() it (which you already did in that same short line), calls a callback (there are parentheses around the object because they fall away in an eval on this, so you can use the same interface to do both data transfer and JSONP-style eval()s).


JSONP was/is regularly used to work around same-source restrictions for XHR: You're not allowed to execute JS code from other domains. Sure there are ways around that, like using code to add a <script> tag to load JS, that new script can come from anywhere, but it's somewhat clunky.

But since you can more freely XHR text from elsewhere, it feels a little less clunky to have some API sends text (that happens to be JSON data or code), then eval() it on the client side. If you're you're going to hook this straight into your page's JS anyway, it makes sense that the client (the application) decides what that function calls is.


It's a convenient but hackish fix, not a security feature - in fact your app is basically exploiting the browser to specifically get back remote code execution from anywhere, so trust your sources or don't use it . (Note this decision can be easier and more sensible if you only ever load from the same site, or any other where only you control the content. Whether you consider your CDN this is, ehhh, up to you. Note that these days there are better solutions e.g. for libraries on CDNs.)


Another way to defeat the XHR-same-source policy issue is to proxy that remote script via your own domain. The easy way is to let it through verbatim which is just as much an insecure eval(). You could choose to sanitize, but most don't (and is not necessarily simple).


The more secure alternative is to transfer data, then parse it with a library (because the entire problem was avoiding eval() of remote data). (JSON is a more popular than e.g. XML because you get types out of the box, and the library will be simpler)


JSON-LD

Encodes linked data in JSON, to allow some semantic metadata in your output.

Arguably primarily a way to get away from that godawful XML stuff.

See also:

"Invalid label" error

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The usual cause is that the server side sends JSON that is not technically correct.

It seems like eval() is peculiar when you do not add brackets around certain objects. Add them for robustness, preferably at the server side, but you could also do it in javascript like:

var o = eval( '('+instring+')' );

(My own mistake was a different and server-side mistake, namely forgetting to encode python data. The way python prints basic data looks very similar to JSON, but is often not valid JSON)


Unsorted

JSON is potentially less secure than pure-data exchange, in that you must know that the source of the JSON is trusted to know it doesn't contain bad code that gets fed directly to the JS interpreter.

Even if JSON usually deals with data, it may also contain code, which will get executed and so can easily enable things like XSS exploits.

This is only a problem if the code can be tricked to either load from a different place than it ought to load from, or if the server side can be tricked into sending something potentially dangerous, neither of which is generally likely.


TypeError: 0 is not JSON serializable

Meaning Python's json library.

In general, just inspect the type()

If it's an integer, it's probably a numpy scalar / 0D array.

See also

YAML

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

YAML (YAML Ain't a Markup Language) is a data serialization format intended to be readable and writable by both computers and humans, stored in plain text.

A lot of YAML values aren't delimited other than that they follow the basic YAML syntax. Only complex features and complex strings (binary data, unicode and such) will require more reading up.


YAML spends a lot language definition to push complexity into the parser and away from a person that wants to write YAML. Various things are possible in two styles (one more compact, the other a little more readable).

YAML is arguably a little handier for code-based manipulation than the likes of XML are. The conversion between YAML to data and back is often simple and obviously, which makes the result more predictable.

It does rely on indenting, which some people don't like (seemingly mostly those with editors that don't care so much about about whitespace)


Scalars

Null:

~
null

Integers (dec, hex, oct):

1234
0x4D2 
02322

Floats:

1.2
0.
1e3
-3.1e+10 
2.7e-3
.inf
-.inf
.nan

Booleans:

true
false

Basic syntax

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Items are often split by explicit syntax or unindents, e.g.:

foo: [1,2]
bar: [3]

foo:
  - 1
  - 2
bar:
  - 3

quu: {
   r: 3.14,
   i: 0.
}


Splitting documents(/records):

---

Usuallly sit on its own line, doesn't have to. Not unusually followed by a #comment on what the next item is.

If you want to emit many records and mark start and end, use --- and ...



Composite/structure types

Lists

- milk
- pumpkin pie
- eggs 
- juice

or inline style:

[milk, pumpkin pie, eggs, juice]


Maps:

name: John Smith
age: 33

or inline style:

{name: John Smith, age: 33}


Comments

# comment

Strings, data

Strings

most any text
"quoted if you like it"

Also:

"unicode \u2222 \U00002222"
"bytestrings \xc2"

Note that strings are taken to be unicode strings, and there is no formally distinguishing it with bytestrings.

(If you want to distinguish them and/or want the exact string type detail to be preserved through YAML, you may want to use a tag (perhaps !!binary for base64 data), or perhaps code some schema/field-specific assumptions)

Mixing strctures

Generally works how you would expect it, given that YAML is indent-sensitive.

For example, a nested list:

- features
  - feature 1
  - feature 2
- caveats
  - caveat 1
  - caveat 2

...a list of hashes:

- {name: John Smith, age: 33}
- name: Mary Sue
  age: 27

...a hash containing lists:

men: [John Smith, Bill Jones]
women:
  - Mary Sue
  - Susan Williams

Tags

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

YAML has a duck-type sort of parser, which won't always do what you want.

Tags force parsing as a specific type, e.g.

not_a_date: !!str 2009-12-12
flt: !!float 123

The tags that currently have meaning in YAML include:

!!null
!!int
!!float
!!bool
!!str
!!binary
!!timestamp

!!seq
!!map
!!omap
!!pairs
!!set

!!merge
!!value 
!!yaml

You can also include your own.


See also:

Advanced features

Relations (anchor/alias)

Further notes

Note that the inline style for the basic data structures (string, numbers, lists, hashes) is often close to JSON syntax, may occasionally be valid JSON. JSON can be seen as a subset of YAML. (apparently specifically after YAML 1.2, because of Unicode handling details)

Given some constraints, you can probably produce text that can be parsed by both YAML and JSON parsers.


Netstrings

Netstrings are a way of transmitting bytestrings by prepending their length, instead of delimiting them. This means they can be placed in general text files or other datastreams, and since it is a simple to implement spec, it can be used by many different programs.

See also:


MessagePack

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Messagepack (or msgpack) is a binary serialization format meant for interchange.

The selling poing seems to be "like JSON but more compact".


There was some confusion around distingishing binary strings and unicode strings, which matters to the practicalities in different languages.

https://msgpack.org/index.html

Bencode

Bencode is similar to netstring, but can also code numbers, lists, and dictionaries. Used in the BitTorrent protocol.

Apparently not really formally defined(verify), but summarized easily and well enough.

See also:


Others

See e.g.

Semi-sorted

CRUD

CRUD (Create, Read, Update and Delete) refers to the four basic operations in persistent storage. The term seems to come from data-accessing GUI design.


Things like HTTP, SQL, filesystems, have these as central operations, with various semantics attached.


CRUD is now often brought up in reference to these semantics.

...or sometimes the fact that some of them are prioritized a bunch, e.g. timeseries database focusing on create and read, whereas update and delete is secondary

REST

Protocol buffers

ASN.1 notes

Summary

ASN.1 (Abstract Syntax Notation One) is used to store and communicate structured data. It defines a number of types, and allows nesting.

The 'abstract' is there because the typed data structures by themselves do not imply the ability to communicate.


...though the ways to serialize the data structures are closely related. They include:

  • DER (Distinguished Encoding Rules) - seems a fairly simple byte coding
  • BER (Basic Encoding rules) (flexible enough to be more complex to implement (fully) than most others(verify))
  • CER (Canonical Encoding Rules)
  • PER (Packed Encoding Rules) is more compact than DER, but requires that the receiving end knows the ASN.1 syntax that was used to encode the data
  • XER (XML Encoding Rules)

Most software uses just one of these. It seems DER and BER are most common(verify).


ASN.1 and its coding rules can be seen as an alternative to, for example, (A)BNF style data descriptions, XML, custom data packing, JSON, and such.

ASN.1 is useful in defining platform-independent standards as it is fairly precise and still abstract. It does not seem very common, perhaps because its use and typing can be overly complex, and codings like BER can be more bloated (amount-of-bytes-wise) than you might want in simple exchanges, though DER is efficient enough for a lot of things.

ASN (and usually the encoding rules) are used in places like SSL, X.509, LDAP, H.323 (VoIP), SNMP, Z39.50, and a number of pretty specific, niche-like uses (e.g. this one in medicine), and others.


Notes on data types

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The tag numbers mentioned here enumerate ASN.1 types (Universal Tag), of which there are about 30.


Notes on strings

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

String types seem to define both encoding (byte coding unless otherwise mentioned) and allowed character set.

  • NumericString (tag 18) - 0-9 and space
  • GeneralString (tag 27) - ISO 2375 plus space and delete (encoding?(verify))
  • GraphicString (tag 25) -
  • VisibleString (tag 26) - Visible characters of IA5 (mostly / subset of ASCII?(verify) also called ISO646String?(verify))
  • IA5String (tag 22) - International (Reference) Alphabet 5, see ITU-T T.50[6]
  • UniversalString (tag 28) - ISO10646 (Unicode) character set (UCS4?(verify))
  • BMPString (tag 30) - (UCS-2 encoded characters from the ISO10646 Basic Multilingual Plane)
  • PrintableString (tag 19) - [A-Za-z0-9'()+,-./:=?] and space
  • VideotexString / T61String (tag 20) (CCITT Recommendation T.61[7]. Also ITU-T 100 plus ITU-T 101 characters?(verify) )
  • TeletexString - See ITU-T T.101 [8]
  • UTF8String (tag 12)

Notes on serialization

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

BER

See also

General information:

Standards, reference:

Unsorted:


Libraries

There are libaries that will turn the syntax into parsers, and/or convert specific data coded in specific encoding rules into data structures (often limited, e.g. to BER, DER and PER)

  • C, C++: [9] (most encoding rules)
  • Python: [10]
  • Java, .NET: [11]
  • Java: [12]
  • ...and others.

Some implementation notes

Python SOAP libraries

pywebsvcs

pywebsvcs ('Web Services for Python') can be both the client-side and server-side of SOAP.

The most interesting part of the package is usually ZSI, but in more detail it consists of:

  • ZSI: Zolera SOAP Infrastructure
  • wstools (WSDL tools)
  • SOAPPy (previously separate, now somewhat redundant. Long-term thread to integrated it into ZSI not quite executed?)(verify)
    • Docstring (API)
    • readme
    • depends on PyXML (one among various XML parsers)
    • depends on fpconst (IEEE754 floating point number stuff. Consists of a single pure-python file that can be installed, or just copied in)
    • not developed anymore, and apparently never finished: it fails to parse Amazon's (admittedly unusual) WSDL. Nice enough to use when it does work, though.



In ZSI, clients can be created

  • using ServiceProxy(wsdlfile) (WSDL-based)
  • Binding(baseurl) (self-defined / uses only simple types)
  • using WSDL-to-python code generation (wsdl2py). Note this code may be (very) bulky, still depends on ZSI, and still seems experimental (fails on Amazon ECS 4)

Unsorted

Note that trying to interface with Amazon ECS from scatch is probably rather complex. I suggest using something like PyAWS (for ECS4. pyamazon, which it is based on, seems to be for the now defunct ECS3)

  • Soapy (not to be confused with SOAPpy)


Yet to read myself

See also