Pieces of larger or complex systems

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Glossary

Glossary of (concurrent) computation

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

"Thread"

Thread of execution

Threads in a processor

Multithreading

Concurrency and parallelism

Coroutines and cooperative multitasking

async and await

async and await are syntactic sugar present in some languages, that often mean you're doing cooperative multitasking -- coroutine style programming.

...where the interpreter/compiler will be helping you judge correctness.

See e.g.

Python_usage_notes_-_concurrency#async.2C_await

Javascript_notes_-_syntax_and_behaviour#async.2C_await

Glossary of exclusiveness and messiness

Atomic operations

Critical section

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Races

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A race condition, a.k.a. a race, is when the behaviour of software (or electronics) depends on the timing of events outside of your control.

Often, "it depends on which part got there first, and there's no telling which one that will be".

For example, when multiple concurrent tasks (where having tasks be independent is part of the point) are accessing the same data (be it files, shared memory, or whatnot), in a non-isolated / non-atomic way, what will happen depends highly on the timing (and thereby ordering) of the operations of both.

Most code is written to assume it is alone and that the data can't change while it works on it.

Particularly around pre-emptive multi-tasking, that is an incorrect assumption. There is little telling where in the sequence of operations a task switch occurs, so there are many variants of interactions, making the possible errors hard to predict, or test for.

Depending on what is actually getting done, the result is anything from the result being incorrect (e.g. two threads adding 1 to a counter just means the count will be too low), to a total system crash.

Race conditions are most commonly solved by making a piece of code, or a resource, usable by only one thread at a time.

This is usually done with mutexes or semaphores. (...or critical sections, which suspend all threads but the executing one regardless of what they are doing. Which in ways is more of a a hack than a solution).

If you do a lot of state sharing, note that there are ways to opimize this. For example, lock-free data structures are often also race-free with in ways that have less effect on performance than "lock everything" style solutions.

There are some libraries that cab help you make code free from race conditions even with threading. There are even programming languages with semantics that do this (mostly). They still tend to have some tradeoff between simple to understand / fast / other things you may care about.

https://en.wikipedia.org/wiki/Race_condition

https://wiki.c2.com/?RaceCondition

Thread-(un)safe

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Thread safety, intuitively, asks "can you access the same thing from multiple threads without that being the reason things break?".

It is technically more precise to call this concurrency-safe, because this can happen anytime you have more than one 'thread of execution', in the generic-rather-than-just-multihreading sense.

That said, it is so much easier to break things in more magical ways with threads than with most any other type of concurrency. Not because of threads per se, but because of what you do: start with shared everything, zero safety, and begrudgingly add a bit of protection. Other methods share less, or tell you what good style is. Threads just throw you in the deep end.

This is primarily about

state belonging to each thread of execution should stay isolated to that thread (as much as you sensibly can)

any state you must share between threads will be accessed in a sequential way (or otherwise safe)

This includes, but is not limited to, cases where two threads call the same code, and that code keeps its own state - see re-entrant code for relevant issues.

Deadlock, livelock

Deadlock and livelock are both situations where the locking happening in two or more threads is the reason things are blocked from getting done.

A deadlock is when threads end up waiting for the other(s), indefinitely.

The typical example is one with two locks, where the order of locking them is the problem:

Consider

two threads, call them P1 and P2

two locks (protecting two resources), call them A and B

the code in both threads is "lock one, lock the other, do work, unlock both"

Because the timing of what the threads do is unrelated, a lot of the possible orders can happen. You could write them out of you want a sense of completeness, but perhaps most interesting is one of the problem cases:

P1 locks A, nothing wrong with that

P2 locks B, nothing wrong with that

P2 locks A - which is currently already locked so we start waiting (for P1) to release it

P1 locks B - which is currently already locked so we start waiting (for P2) to release it

There is nothing wrong with each individual action, except that in this specific sequence of events,

both are now waiting indefinitely on the other, before they can start work.

neither will release locks until they're done with that work (which we just mentioned neither will start with)

This can never be resolved, so both will wait indefinitely - this is a deadlock.

There are ways to detect deadlocks ahead of time.

There are other ways to avoid them that are sometimes simpler.

...including less finely grained locks.

Consider a single lock that protects both mentioned resources at the same time.

For the above case ("lock one, lock the other, do work, unlock both") this is actually equivalent -- the processes make their work purely sequential

In the general case, it is easily less efficient to use one big lock, because things that don't even touch the resources may now be waiting on each other.

Yet it's better than hanging forever.

And often it's not too hard to think something up that is safe but doesn't defeats the concurrency entirely.

Other deadlocks are resolved for you. Say, relational databases will typically contain a lot of logic that basically means that if two transactions happen to deadlock, one will be rolled back.

Specific deadlock situations may be easy enough to detect happening ahead of time, but this require knowledge of the situation, its locks, and how to write such checks without race conditions.

Sometimes there's a cheat in locks with timeouts and make your code retry. ...but this doesn't scale well at all, and even at smaller scale can cause other problems, like livelocks instead.

When you get to design the locking (e.g. in your own programs, rather than in a given database), consider different approach to locking.

There has been plenty of research into lock-free data structures, cleverer lock relief, and such.

A livelock refers to situations where there is regular locking and unlocking going on, but the system is trapped in a limited subset of its states and it may not get out of this subset within acceptable time, or possibly even at all.

Often these states are functional/bookkeeping states between actual work, so it's not doing anything useful either. Meaning you can see it as a type of starvation, and arguably practically equivalent to a deadlock.

It can happen when the concurrency model is loose and/or complex, but exclusive locking is still an important part of it.

Livelocks may be harder to predict, detect, and may depend on secondary effects like timing and amount of load.

Starvation

(Lock) contention, false contention, lock convoys

Thundering herd

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Locks and lock-like implementations

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lock hierarchies

Re-entrant code

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Re-entrant code may be defined most by the idea that code can be called when another call on that code is already underway

...re-entrant literally just refers to entering that code again.

This can still be meant in a few ways, and you usually hear a slightly more specific term/property (though they are closely related), like:

is safe to pause and continue execution later
can be safely called before a previous invocation has completed
code that can be called by multiple threads of execution
execution that keeps only local, static, non-shared state

...most of which don't so much define what re-entrancy is (these are e.g. also properties you care about around recursive functions), as it is thinking about a specific ways to (not) violate it.

And yes, in most cases, most of what ensures re-entrancy is about not sharing state with other execution (e.g. keeping your state to parameters and local variables). But

the case-specific details are also important for a more complete grasp.

you sometimes have to share state (a lot of heap allocation is), and it is your code that should think about ensuring re-entrancy

Recursion may be the easiest introduction, because it illustrates that the state of each call into a function can (and should) be kept entirely distinct from every other (each with its own scope based on its own stack frame), and that you don't need concurrency (that you don't control) to worry about re-entrancy.

Another example is the C library function strtok, which which lets you split a string by a delimiter. It was written so that repeat calls give you just the next token (until there are no more), and the original version would basically remember itself where it was in the string, making repeat calls easier. That memory is however in global state, which means that you cannot start tokenizing a new string until you're done with the last. The variant strtok_r basically makes you remember the state (has an extra parameter). It's a little more typing, but this makes it re-entrant.

But beyond that, re-entrancy is is also quality you care about in a system that does concurrency (multitasking, threading, but also concurrency that is purely sequential and cooperative), and is also relevant around hardware interrupts, callbacks that may fire quickly (e.g. on a timer) and may overlap, and some other things.

Each of those are their own topic, because what makes code unsafe in each context is specific, as are the solutions (e.g. hardware interrupts have some unique worries).

Around concurrency, your largest worries are often

not using shared or global state
atomic changes, because races are an easy way to make something violate re-entrancy

Re-entrant code is intuitively related to thread safety, because both are about doing things safely with probably-shared state.

However,

code can be re-entrant but not thread-safe

in part because re-entrant can be about safety within just one thread.

code can be thread-safe without being re-entrant

e.g. when the violation doesn't matter - e.g. use of random_r() is important for fully reproducable results, but in a bunch of cases random() will be acceptable

...or both, or neither.

There are some good examples here: https://stackoverflow.com/questions/856823/threadsafe-vs-re-entrant

That said, a lot of code either

doesn't have to worry about either, or

makes it easy, or hard-but-important, to do both

Split-brain

Split-brain describes a situation of inconsistencies between parts of a larger system.

It sometimes used for intentional design, e.g. private-network DNS and outside-facing DNS giving different answers.

...but much more usually to cases like

distributed services failing to keep synchronized
synchronization not happening properly around times of failover.

problems in failover logic, like multiple servers believing they should be the new master (often takes relatively pathological cases)

Dealing with split-brain often involves a tradeoff between availability and consistency.

https://en.wikipedia.org/wiki/Split-brain_(computing)

Glossary of scaling

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Types of scaling (note that this has become a bit more of a continuum in the multi-core, multi-CPU, and have-GPU world)

horizontal scaling, a.k.a. scaling out:

more boxes: adding more computers to do parts of a larger job

...either all with the same information, and a knowledge of who works on what,

...or horizontal partitioning , a.k.a. sharding, to distribute into entirely distinct, self-contained jobs

vertical scaling, a.k.a. scaling up:

bigger boxes: adding more processing power, e.g. better/more CPUs or GPUs, to the same computer

In comparison to horizontal scaling, this tends to be

faster up to some point (due to local interconnects being faster than most networking - but this is also why specialized networking blurs this line)

often pricier for the same compute, so only worth it for some types of jobs (e.g. those that require a lot of computation and near-zero latency between parts)

In practice, you'd do a cost-benefit estimation of how far to scale up each node before you start scaling out

sometimes a little more stable, in that a single node or single network connection can't trip up large jobs

At some point, speed of interconnection becomes the main issue.

particularly when you need unusually-high-speed/low-latency networking.

cluster computing versus grid computing is not a clear distinction, though in general the associations are:

a cluster is more homogeneous, i.e. mostly identical hardware

a grid heterogeneous, "whatever you have"

a cluster is often a tightly coupled, predetermined set of computers, and typically all in the same rack(s)

a grid is often more loosely coupled and more widely spread

a cluster's jobs are often centralized and synchronized,

a grid's jobs may be more more loose/flexible/dynamic/automatic (...by necessity)

a cluster is more likely to have a central node give jobs to the workers

a grid is more likely to have each client ask for work from the central node

"Decentralized"

ELSEWHERE

Glossary - further concepts

Coroutines

Continuations

Informally, continuations are the abstract concept of 'the rest of a calculation'.

Sometimes as just a thought exercise, sometimes in a very duct-taped way, but when we give it the name continuation we often have a more generic language/library/interface framework that somehow allows a chunk of nontrivial code to be chunked/paused, and continued in some way or other.

Ideally in a way that is not clunky to write or understand.

Examples will typically also tie in practical execution details, like async, multithreading, multiprocessing,

For those familiar with generators, those are effectively continuation-style iterables.

Generators demonstrate one of the things continuations can be useful for: You can get work on demand - just the work necessary for the next yield, and, you get such state without theading.

Continuations can also be useful for concurrency-geared control structures.

Scaling, parallelizing and such

Some inherent limitations and choices

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Common fallacies of distributed computing (suggested by Deutsch and various others):

bandwidth is infinite
latency is zero
transport cost is zero
the network is reliable
the network is secure
the network is homogeneous
topology doesn't change
there is one administrator

If you assume any of these, that may return as a fundamental limitation.

Some fairly easy acceptable solutions - but often at some cost, possibly in one of the other listed aspects.

Trusism: Nothing saves you from not enough resources for your load

When you are writing a low latency server, web- or otherwise

If you are accepting enough requests for more than one second of CPUwork per second, nothing will save you.

If you are accepting enough requests for more than one second of IO work per second, nothing will save you.

Varied cleverness is tradeoffs only, and often exceedingly basic ones.

If you have multiple cores, you probably want as many workers, to you know, actually use them.

Say, memcaches are primarily about storing what you don't need to calculate each time - because it's redundant to start with.

memcached may stick around 1ms ...but only because its operations are restricted to extremely basic things.

A lot of subtleties help some solid but small percentage

Serial processing can always be a limit

The mobility gap

Based on the observations that:

Moore's law - processing growing exponentially (currently it's often the one we have to think about least)

Kryder's law - storage growing exponentially (scaling takes some worry in relation to error+failure, but we've started dealing with that),

Nielson's law - network speed growing more or less linearly.

Those combine to mean that over time, the way we are likely to design larger systems mean we spend more time sending data around instead of processing it.

We have called this the mobility gap.

Given the above patterns, this will probably continue to change how we solve problems.

Which implies it is preferable to process it where it is currently stored.

Put another way, move programs to the data, rather than data to programs, which is why approaches like map-reduce became useful, even necessary:

Brewer's CAP theorem

See Data_consistency_and_versioning,_and_its_concepts#CAP

Scope of consistency

Consensus

Offline mode

Amdahl's law

Storage

Flavours

Further considerations

On locks

Re-entrant lock

Critical sections

Mutex

Futex

'Fast Userspace Mutex' (appeared in linux 2.6 kernels)

The idea behind futexes is that when there is contention on the futex, things can be resolved in userspace library code rather than kernel, so in that case it is faster than a mutex, which always incurs a system calls.

In the case there is contention on the futex, arbitration is assisted by a kernel-space wait queue like any mutex.

In other words, for low-volume locking it is often a more efficient.

Spinlock, busy waiting

Semaphore

higher-level constucts, communication models, and such

Granularity of locking

Opportunistic locking

A Microsoft name, mostly for how it does client locking in SMB/CIFS context.

This seems to be a server-mediated lock that counts on there being one main user for a file, and does some signalling whenever there are more.

This lets the first user cache contents locally (and buffer changes) as if it had a read/write lock on the server. More complex lock/cache semantics only happens when additional users come in on the same file.

Those semantics are roughly that the second client requests a lock break, which means it asks the server to ask the first client to write its changes. (what's the state of the locks after that, though?)

https://web.archive.org/web/20190110123229/https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/overview

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365433(v=vs.85).aspx

https://www.samba.org/samba/docs/old/Samba3-HOWTO/locking.html

Optimistic locking, pessimistic locking

Optimistic locking means you start working under the assumption that most of the time, there is no conflict.

For example, you might

fetch a record (plus something that lets you check correctness later, like a version or timestamp),
figure out what the new record would be
check the changes are valid and save

If that checkable thing checks out (e.g. that record did not change its version or timestamp), you can just replace it.

If not, you decide what you should do instead -- and this may be as complex as any rollback or merge logic.

The difference is that instead of putting a lock before all that and an unlock after, now only that last 'check and save' needs to be locked.

If it is rare that this checkable thing fails, you can avoid a lot of locking, and you might avoid lock convoys (e.g. when operations are quick and small, compared to the locking operation)).

That said, in uses where you typically do run into conflicts, this will be no better (still creating lock convoys), and potentially worse (e.g. when there's repeated redoing).

This this does not distribute or scale very well. There are similar tricks to this in scaling, but their implementation is specific.

Optimistic locking is a good idea if you know the access patterns well. For example, "update our search index every hour" is likely to have few or no such collisions because it's essentially a sequential batch job.

Pessimistic locking is, then, roughly the assumption that collisions are either common or troublesome enough (consider also races, dirty reads) that locking always is the better choice, leading to behaviour may not be the fastest, but is easy to prove the correctness of.

It's the right thing to do for strict consistency management (personal information, financial transactions), but can kill performance for sometimes no good reason.

Advisory and mandatory locks

Advisory locks are those that are not enforced.

Each relevant application chooses to observe them correctly - or may not.

Whereas mandatory locks are enforced for you.

Advisory locks are useful when

there is no easy way to implement proper locking, e.g. you are a random program wanting to exclude another copy of yourself (and you can't count on kernel or services features being there)

...but you do want to try your best regardness

you want pessimistic locking where in a context that allows optimistic locking

An advisory lock can be as simple as "a file existing here", though because this implies no-transaction context, this allows races and other edge cases.

Lock files and file locks

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

?¿ Some package/project managers use a something.lock file not in the sense of exclusive access as detailed below, but to mean "we would like you to install this version of a dependent package" - to lock you to that version (e.g. Pipfile.lock, poetry.lock). This is entirely unrelated.

Lock files are a mechanism of advisory locks, often using files with predictable names as markers for what is in use.

For example, a program can create a file foo.lock to signal that foo is currently in use. (such a file does not need any contents, though sometimes it might contain a process ID to get an idea of who is using it)

Any other programs that wish to observe this convention can detect this, and e.g. delay opening that file until that lock is gone.

Since it's typically polled, it is not free of race conditions, nor very fast, but for a number of applications it's already a lot better than just assuming you're the only instance (and clobbering each other's work, editing the same file, or such).

File locks usually refer to mandatory locks that you can make on filenames, which make certain file operations block (for other processes) until released.

This is often done at OS level, potentially only on specific file systems.

Doing this on networked filesystems requires distributed lock management, which is easily too inefficient or too fragile and doesn't scale well -- so is often not supported until you do extra work.

There are cases in which it works well, there are cases in which it really amounts to advisory locking and you might be better off using actual advisory locking like lock files, there are cases for which you want to build something cleverer.

nix file locking

POSIX file locks

fcntl(F_SET_LK)

upsides

portable

can work across NFS

byte-range locking

downsides

tied to to processes, not file descriptors - meaning it doesn't solve much around threading

not secure, e.g. released on close() which can mean embedded code could break your locking logic

BSD flock

locks tied to file descriptors, not processes

upsides

safer around threads, useful around forks

downsides

not secure

not useful across NFS

less portable than POSIX

on linux may actually be implemented using POSIX(verify)

inter-process locks

Advisory locking is typically done via lock files, are not very strict but may be good enough for any purpose where apps want to cooperate.

Mandatory locks are possible but tend to be less portable (...fnctl and/or mount options).

Process-shared mutex

named mutex/semaphores

IPC semaphores

SHM mechanisms

flock notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The flock command wraps flock() syscalls around commands that it runs, which means such commands on the same lockpath will run in series.

Example:

flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &

A more practical example might to make cron jobs that may sometimes be long-running sometimes not overlap (though note if they always are, they would seriously queue instead).

If you want to test for a lock being present (rather than unconditionally queue a command), then you want either:

-n means nonblocking - exit immediately, instead of waiting indefinitely
-w n is a timeout in seconds - after which to give up and exit

# in the background to have a thing to test against
flock      foo -c 'sleep 1h' 
 
# give up immediately
flock -n   foo -c true || echo nope
 
# give up after 5 seconds
flock -w 5 foo -c true || echo nope

Notes:

will create nonexisting lock files

Without a command, you'll get a "flock: bad file descriptor" - why? (verify)

many but not all filesystems support this

You can see what's holding a lock on a file like fuser -v foo

Non-blocking, lock-free, wait-free

On data

Communicating

With any sort of parallel communication, atomic communication of data can be a problem, particularly whenever threads are involved.

A now fairly common solution is to model in demand/supply in some way, commonly via queues, and/or generators. Such a setup forces the programmer to read and write packages of information at a time, and can allow lazy creation.

This may imply that you should always communicate copies of data, and not heap references.

When handling large amounts of data and you wish to avoid unnecessary copies (databases would, for example) you could probably get away with MVCC-style setup, but it does mean a lot of management even if it avoids locking.

See how things are handled e.g. in

Versioning

lock-edit-commit-unlock

A resource is locked as a whole.

The easiest to code.

Also tends to scale badly, because things are limited by the speed the lock can be locked and unlocked.

Latency reduces the maximum rate, and things will sit around waiting, leading to lock contention.

(The way they wait can also be relevant, see e.g. polling/spinlocks versus event-driven things)

Typically the same actor that locked it must unlock it.

...so without timeouts, this can can leave the resource unusable.

There are also potential deadlocks if actors want to lock multiple thens.

Note that instead of an exclusive lock, you could use an exclusive write lock: others can read, but none should/can write without a lock. This is a little harder to code in that you may now have to deal with atomicity of the changes.

branch-and-merge, copy-edit-replace

Lockless, so people/processes don't lock each other while they work.

A copy(/branch; the difference is mostly in terminology and housekeeping) is edited, and the changes are re-committed.

Can be useful when difference alterations are likely to be independents somehow - different files, different regions in text files, or such.

Often relies on (mostly-)automatic conflict resolution (...partly because manually resolving such problems is probably a time loss comparable to waiting for a lock).

'Just use the last version' may be good enough in some situations -- but that would mostly just replace with a single version and throw away all others, which is only a satisfying solution, well, when it makes sense in the particular application.

MVCC

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr:

an alternative to aggressive locking - but cannot completely remove the need for locks

tends to avoid reading clients blocking writing clients, and the other way around

so e.g. read-heavy workloads tend to wait less

can not prevent the occurence of deadlocks

though you can informedly make them less likely

Multiversion Concurrency Control (MVCC) steps away from handling data both monolithically and purely sequentially.

Its focus is on isolating distinct versions of data, and often works out as each client seeing/handling a consistent snapshot of a certain unit of data (where that unit may be a database row, a file, or such) within a certain transaction (e.g. a database transaction, a file operation, etc.).

Conceptually, one client may be working on creating a new version while everything else still sees the current/old version, and only at commit time do we need some detailed handling, not during that reaction.

Details vary with use case (e.g. databases, files, etc), and even with implementation within a use case (e.g. there are a few different MVCC systems in RDBMS systems)

In some applications, it can work out as a variant of copy-edit-replace which is aware of transactions and basic data modeling.

In some situations, for example where commits on the same database tables aren't on the same rows (e.g. different people updating their own profile text) or not conflicting in other ways (say, someone clicking 'set name' on their user profile twice), it can be a perfectly useful middle way between the slowness of locking the entire table for each such operation, and the possible inconsistency of dirty reads and such.

MVCC works well in mostly-read, occasional-write, particularly when there is no strict necessity to see the very latest data at the earliest possible time, as you can serve the latest committed version without delay without it being potentially inconsistent.

From that perspective it provides lockless versioning. If properly set up, it need not break (m)any other guarantees.

When there are multiple writing clients at the same time, things depend on decisions. You can say that the one that committed last becomes current, and the rest is old - it is usually fairly trivial to decide which of these is old, though you can make this cleverer and more complex if necessary.

This may sound like a bad idea, but consider that sequentially handling commits that were isolated alterations on the same basic data amounts to the same idea but with the latest commit time being decisive. In some situations you want really these operations to be serialized, in others it really doesn't matter.

MVCC is not a very good idea when there are dependencies you can break - but that's arguably a misapplication or a fault in modelling, or code - not locking where locking is necessary under most every underlying model.

Note that it certainly matters what your unit of data encompasses (and doesn't), and also whether there is transactioning around it (which may make overall management and cleaning of old versions a little harder - see e.g. the guts of Postgresql's MVCC).

APIs

Larger-scale computing

The Seven Fallacies of Distributed Computing: Reliable delivery; Zero latency; Infinite bandwidth; Secure transmissions; Stable topology; Single adminstrator; Zero cost. --Peter Deutsch

There recently has been a general move to get more processing power by interconnecting many machines, either in your own cluster or in a cloud service. As opposed to putting more power into a single monolithic - and often disproportionately pricy - workhorse.

Which of the specific options is best depends on many things. You can think and read and write about this one for years, but a little common sense thinking about effective bottlenecks goes a long way.

Threading versus message passing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Broadly speaking, there are two approaches to communication in parallel execution: threading + shared-memory techniques, and message passing.

Multiprocessing models:

UMA (Uniform Memory Access), commonly as SMP:

processor owns no memory (other than its cache), and all view the same memory

e.g. a single bus from CPU to memory controller

speed very predictable, no special planning necessary

...but doesn't well beyond a handful of cores

latency ~10ns(verify)

NUMA (Non-Uniform Memory Access): processor owns its own memory, and can view that of others more slowly

e.g. HT and QPI means each CPU talks directly its own physical memory (yes, memory controller is in the CPU now), and interchanges internally when necessary.

NUMA allows further (vertical) scaling than strict UMA

but varying distance/time to different memory and dealing with ownership means access speeds vary more. The averages are higher than in UMA/SMP (though in theory scales somewhat better)

cache coherency is more complicated (and slower), which stops the scalability at some point

latency ~100ns(verify)

Cluster: nodes owns their own memory, cannot address that of others

(at opcode level; sofware does it instead)

latency ~1000ns or a bunch of multiples more(verify)

Horizontal scaling inevitably slides down this scale.

To quickly contrast:

When the same memory is mapped by the same thread (of execution, so this includes processes, it

allows no-copy sharing
requires explicit synchronization

On a single workstation/server (UMA/SMP, or small NUMA via HT/QPI), transfer costs are low and predictable, which makes this more efficient than cluster.

It can be error-prone to have a basis that never synchronizes/locks until you realize you need it (though there are now languages with direct awareness of threading safeness, and better threading libraries for those that are not).

Note that UMA and NUMA can be mixed, e.g. on double-CPU motherboards.

Some larger systems still manage to be NUMA fairly efficiently (e.g. being aware of its own topology).

...but at some point you have to give up and break from it, and effectively go cluster instead.

At this point, message passing becomes preferable.

Data is always shared via a send/copy - which usually is the case anyway (and can sometimes be upgraded to no-copy within the same host under certain conditions)

Message passing is cooperative (a send has to be received), synchronization is implicit with complete sends (avoiding many race conditions).

When data serialization is often part of the protocol, this is easily more portable.

...and many of these details can make the overhead higher. How much higher depends on how much you work with it.

Single-host computing

Distributed, parallel, cluster, grid

Memory issues

Commoditized computing (a.k.a. cloud)

See Cloudy notes

Parallel processing libraries

OpenMP

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

OpenMP targets shared-memory systems (only) - so in some ways is just a portable alternative, or wrapper, to pthreads / windows threads.

A compiler-level (C, C++, FORTRAN) thing that decently abstracts, so makes it fairly easy to safely access data structures, make syncing easier, do parallel sections, parallel loops, and such.

Not to be confused with (Open)MPI.

MPI

MPI is a spec, aimed at coordinating a bunch of processes, and optionally any transfers between them.

More technically, MPI is just one of many message passing specifications/implementations.

Implementations include OpenMPI, MPICH, HP-MPI, and others - see [1].

MPI is a little more language/environment-agnostic than some alternatives - if a little overdesigned (to the point that sometimes makes it a bit hairy to use, though usually just in terms of config that are for your sysadmin to worry about).

When/why MPI?

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Some notes on MPI structure and jargon

Examples

Hello world

A minimal hello world is:

#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv) {
  int ierr;

  MPI_Init(*argc,**argv);   // returns error code, here we assume it won't have trouble
  printf("hello world\n");
  MPI_Finalize();           // ditto on the error code
}

As part of the init (which gets some details via the fact that you'll start it via mpirun with something like -np 4), one process takes care of starting the others, which means that all of them print hello world.

When you want to tell the processes apart, or communicate, you'll probably want to fetch and report the rank (and often size)

Distributing jobs

MPI config trouble

A requested component was not found, or was unable to be opened

At different levels, the lines

ompi_rte_init failed --> Returned "Not found" (-13) 
ORTE_ERROR_LOG: Not found in file runtime/orte_init.c

seem to mean the same thing.

ompi_info should list at least a few dozen MCA lines. If it shows only a few, and the following error, then you've probably got a mix of openmpi versions installed, and paths (binary/library) effectively mixing them.

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      myhostname
Framework: crs
Component: none

However, this is not always a good test when specific software may do things in its own environment.

Keyval parser: error 1 reading file mpicc-wrapper-data.txt

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Seems to not be fatal (compilation works), but presumably it won't use the configuration.

Seems to typically be a bad openmpi config (possibly because of updates, mixing versions, etc). I have no idea in which way it's bad, and it seems that since the problem is rare no one has really figured that out.

A fairly thorough way to fix it is to uninstall MPI, clean any leftovers (e.g. anything that locate -e openmpi mentions that looks like config files), and reinstall(/rebuild).

Runtime MPI Errors

For help debugging programs using MPI, see http://www.open-mpi.org/faq/?category=debugging and similar, and some of the below.

mca_btl_tcp_endpoint_complete_connect] connect() to ipaddress failed: No route to host

Boils down to a generic TCP "No route to host" error (which is what errno 113 refers to).

Check NIC status and link lights (e.g. ethtool).

Check routing rules (e.g. netstat -r).

Error 137

Seems to be a generic out-of-resource error.

In our case it came from a process that allocated more than it should have, and was oom_killled

MPI_ABORT was invoked on rank 0 in communicator mpi_comm_world

Regularly

Notes on threading implementations (lower levels)

tl;dr

windows and posix threads

On hardware

APIs and implementations

On threads and cores

Notes on thinking about threads and other parallelism (higher level)

Unsorted

Locking, data versioning, concurrency, and larger-scale computing notes

Glossary

Glossary of (concurrent) computation

"Thread"

Thread of execution

Threads in a processor

Multithreading

Concurrency and parallelism

Coroutines and cooperative multitasking

async and await

Glossary of exclusiveness and messiness

Atomic operations

Critical section

Races

Thread-(un)safe

Deadlock, livelock

Starvation

(Lock) contention, false contention, lock convoys

Thundering herd

Locks and lock-like implementations

Lock hierarchies

Re-entrant code

Split-brain

Glossary of scaling

"Decentralized"

Glossary - further concepts

Coroutines

Continuations

Scaling, parallelizing and such

Some inherent limitations and choices

Trusism: Nothing saves you from not enough resources for your load

Serial processing can always be a limit

The mobility gap

Brewer's CAP theorem

Scope of consistency

Consensus

Offline mode

Amdahl's law

Storage

Flavours

Further considerations

On locks

Re-entrant lock

Critical sections

Mutex

Futex

Spinlock, busy waiting

Semaphore

higher-level constucts, communication models, and such

Granularity of locking

Opportunistic locking

Optimistic locking, pessimistic locking

Advisory and mandatory locks

Lock files and file locks

inter-process locks

flock notes

Non-blocking, lock-free, wait-free

On data

Communicating

Versioning

lock-edit-commit-unlock

branch-and-merge, copy-edit-replace

MVCC

APIs

Larger-scale computing

Threading versus message passing

Single-host computing

Distributed, parallel, cluster, grid

Memory issues

Commoditized computing (a.k.a. cloud)

Parallel processing libraries

OpenMP

MPI

When/why MPI?

Some notes on MPI structure and jargon

Examples

Hello world

Distributing jobs

MPI config trouble

A requested component was not found, or was unable to be opened