Locking, data versioning, concurrency, and larger-scale computing notes

From Helpful
(Redirected from MPI)
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Contents

Glossary

Glossary of scaling

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Types of scaling (note that this has become a bit more of a continuum in the multi-core, multi-CPU, and have-GPU world)

  • horizontal scaling, a.k.a. scaling out:
more boxes: adding more computers to do parts of a larger job
...either all with the same information, and a knowledge of who works on what,
...or horizontal partitioning , a.k.a. sharding, to distribute distinct, self-contained jobs


  • vertical scaling, a.k.a. scaling up:
bigger boxes: adding more processing power, e.g. better/more CPUs or GPUs, to the same computer
In comparison to horizontal scaling, this tends to be
faster up to some point (due to local interconnects being faster than most networking)
pricier, so only worth it for some jobs (mostly those that require a lot of interaction between parts)
In practice, you'd do a cost-benefit estimation of how far to scale up each node before you start scaling out
At some point, speed of interconnection becomes the main issue.
particularly when you need unusually-high-speed/low-latency networking.



  • cluster computing versus grid computing is not a clear distinction, though in general the associations are:
a cluster is more homogeneous, i.e. mostly identical hardware, a grid heterogeneous i.e. more diverse
a cluster is often a tightly coupled, predetermined set of computers, and typically all in the same rack(s); a grid is often more loosely coupled and more widely spread
a cluster's jobs are often centralized and synchronized, that of a grid more loose/flexible/dynamic/automatic (...by necessity)


Glossary of concurrent computation

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A 'thread of execution' is a term often used in the widest sense, of "an distinct running of some code (with the necessary state for such)", and can include:


Glossary of exclusiveness

Atomic operations

Critical section

Races

Threads in a processor

Thread of execution

Thread-(un)safe

Deadlock, livelock

Deadlock and livelock are situations where the locking in two threads blocks both from getting work done.


A deadlock is when threads end up indefinitely waiting for the other(s). Usually because of the order they did multiple locks

The basic example is

two threads
two locks (protecting two resources)
a thread behaving like: lock one, lock the other, do work, unlock both.

If the other thread did the locks in reverse order from the other, they now both hold one lock, and are waiting until the other thread releases the other. Which it won't.

Deadlocks can often be avoided with some smart-enough checks before locking.

A given situation may be served better by a different approach to locking.


A livelock refers to situations where there is regular locking and unlocking going on, but the system is trapped in a limited subset of its states and it won't get out of this subset within acceptable time, or possibly even at all.

Often these states are functional/bookkeeping states between actual work, so it's not doing anything useful either. Meaning you can see it as a type of starvation, and arguably practically equivalent to a deadlock. It can happen when the concurrency model is loose and/or complex, but exclusive locking is still an important part of it.

Livelocks may be harder to predict, detect, and may depend on secondary effects like timing and amount of load.


Starvation

(Lock) contention, false contention, lock convoys

Thundering herd

Locks and lock-like implementations

Re-entrant code

Glossary - further concepts

Coroutines

Continuations

Informally, continuations are the abstract concept of 'the rest of a calculation'.


Sometimes as just a thought exercise, sometimes in a very duct-taped way, but when we give it the name continuation we often have a more generic language/library/interface framework that somehow allows a chunk of nontrivial code to be chunked/paused, and continued in some way or other.

Ideally in a way that is not clunky to write or understand.


Examples will typically also tie in practical execution details, like async, multithreading, multiprocessing,


For those familiar with generators, those are effectively continuation-style iterables.

Generators demonstrate one of the things continuations can be useful for: You can get work on demand - just the work necessary for the next yield, and, you get such state without theading.


Continuations can also be useful for concurrency-geared control structures.

Scaling, parallelizing and such

Some inherent limitations and choices

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Common fallacies of distributed computing (suggested by Deutsch and various others):

  • bandwidth is infinite
  • latency is zero (not the same, consider also switching)
  • transport cost is zero
  • the network is reliable
  • the network is secure
  • the network is homogeneous
  • topology doesn't change
  • there is one administrator

Basically, if you assume any of these, that will probably return as a fundamental limitation/annoyance. Some do have acceptable automatic solutions - but often at some cost, possibly in one of the other listed aspects.


There is the angle that some have called the mobility gap, based on the observations that:

Moore's law says processing grows exponentially (currently it's often the one we have to think about least)
Kryder's law says storage grows exponentially (scaling takes some worry in relation to error+failure, but we're starting to dealing with that),
Nielson's law says network speed grows linearly, more or less.

One implication is that we spend more spend more time sending data around before we can process it.

This is why approaches like map-reduce became necessary when you have a lot of data you want to process as a whole: It is becoming significantly cheaper to move programs to the data than to move data to programs. Given the above patterns, this will continue to change how we solve problems.


There are many variant angles here, and may talk more specifically how scalability, bandwidth and such are involved in (modeled) practice.

Brewer's CAP theorem

One of the major ones is Brewer's CAP theorem[1], which basically says that no matter your answer, you are always choosing a tradeoff.

It refers to:

  • consistency - all nodes see data according to the system's own rules
often "do all nodes see the same data at the same time", and often via "once new state is successfully accepted, we expect no older state back"
  • availability - you will get a response, regardless of node failure
That reponse may be 'error', does not prevent the rest from continuing?)}}, and
  • partition tolerance - system continues even when parts didn't communicate perfectly
note that partitioning refers to "parts that don't agree".
including "Because a computer fell over due to hardware and then came back"
also including "because some messages got lost"
and often means non-partitioned whole again somehow


Note that all are entangled, more so when you include "within reasonable time" in the description and in in terms of possible failures all are a matter of degree.

Also, it's not the "pick any combination of two" implying that the combination of all three is an absolute no-go - it just tends towards that once you start thinking large scale and/or quite low latency. But even there - especially there - you're talking well-informed tradeoffs, not exclusion.


This ones's a great introduction:




Eventual consistency

BASE

Scope of consistency

Consensus

Offline mode

Amdahl's law

See also:

Storage

Flavours
SAN versus NAS

Further considerations

On locks

Re-entrant lock

Critical section, mutex (mutual exclusion)

Futex

Spinlock, busy waiting

Semaphore

higher-level constucts, communication models, and such

Granularity of locking

Opportunistic locking

Microsoft refers to this, mostly in SMB/CIFS context.


This seems to be a server-mediated lock that counts on there being one main user for a file, and signalling when there is another.

This lets the first user cache contents locally (and buffer changes) as if it had a read/write lock on the server, while adding some semantics about what happens only when further users come in.

Those semantics are roughly that the second client requests a lock break, which means it asks the server to ask the first client to write its changes. (what's the state of the locks after that, though?)


https://msdn.microsoft.com/en-us/library/windows/desktop/aa365433(v=vs.85).aspx

https://www.samba.org/samba/docs/old/Samba3-HOWTO/locking.html

Optimistic locking, pessimistic locking

Optimistic locking means you

assume your work typically does not affect other work,
later check that applying your change is entirely correct.
usually that a data version or timestamp you collected before you started is still what's there
you explicitly handle the case where you need to redo it

Since no step of that was locking, this makes sense when locks are overly coarse, e.g. a table lock when most things operate on rows, and rarely on the same ones. And can sometimes avoid things like lock convoys.

Depending on implementation, it may or may not introduce the possibility of dirty reads. Even where you do, there are various cases that does does not make much difference (e.g. person sees their new score next pageload instead).


In contrast, pessimistic locking is the exclusive sort that avoids races, redoing work, and dirty reads where that applies. It's the right thing to do for strict correctness (personal information, financial transactions), but can kill performance for sometimes no good reason.

Advisory and mandatory locks

Advisory locks are those that are not enforced. Instead, it is up to each relevant application to observe them correctly.


Opposed to mandatory locks, which are enforced for you.


Advisory locks are useful when

  • there is no easy way to implement proper locking, e.g. you are a random program wanting to exclude another copy of yourself (and you can't count on kernel or services features being there)
  • you want pessimistic locking where in a context that allows optimistic locking


An advisory lock can be as simple as "a file existing here", though because this implies no-transaction context, this allows races and other edge cases.

Lock files and file locks

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Lock files are a mechanism of advisory locks, often using files with predictable names as markers for what is in use.

For example, a program can create an empty file foo.lock to signal that foo is currently in use. Any other programs that wish to observe this convention can delay opening that file until that lock is gone.

This is not always free of race conditions, and since it's polled it's slower than it could be, but for a number of appliatins it's a lot better than no locking at all when you just want to prevent multiple processes clobbering each other's work, editing the same file, or such.


File locks usually refer to mandatory locks that you can make on filenames, which make certain file operations block (for other processes) until released.

Local file systems often can do this, managed by the system.

Doing this on networked filesystems requires distributed lock management, which is easily too inefficient or too fragile and doesn't scale well. There are cases in which it works, there are cases in which advisory locking is handier, and for applications you may want to build something cleverer.



  • nix file locking
  • POSIX file locks
fcntl(F_SET_LK)
upsides
portable
can work across NFS
byte-range locking
downsides
tied to to processes, not file descriptors - meaning it doesn't solve much around threading
not secure, e.g. released on close() which can mean embedded code could break your locking logic
  • BSD flock
locks tied to file descriptors, not processes
upsides
safer around threads, useful around forks
downsides
not secure
not useful across NFS
less portable than POSIX
on linux may actually be implemented using POSIX(verify)

inter-process locks

Advisory locking is typically done via lock files, are not very strict but may be good enough for any purpose where apps want to cooperate.


Mandatory locks are possible but tend to be less portable (...fnctl and/or mount options).


Process-shared mutex


named mutex/semaphores


IPC semaphores


SHM mechanisms

flock notes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The flock command wraps flock() syscalls around commands that it runs, which means such commands on the same lockpath will run in series.

Example:

flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &

A more practical example might to make cron jobs that may sometimes be long-running sometimes not overlap (though note if they always are, they would seriously queue instead).



If you want to test for a lock being present (rather than unconditionally queue a command), then you want either:

  • -n means nonblocking - exit immediately, instead of waiting indefinitely
  • -w n is a timeout in seconds - after which to give up and exit
# in the background to have a thing to test against
flock      foo -c 'sleep 1h' 
 
# give up immediately
flock -n   foo -c true || echo nope
 
# give up after 5 seconds
flock -w 5 foo -c true || echo nope


Notes:

  • will create nonexisting lock files
  • Without a command, you'll get a "flock: bad file descriptor" - why? (verify)
  • many but not all filesystems support this
  • You can see what's holding a lock on
fuser -v foo

Non-blocking, lock-free, wait-free

On data

Communicating

With any sort of parallel communication, atomic communication of data can be a problem, particularly whenever threads are involved.

A now fairly common solution is to model in demand/supply in some way, commonly via queues, and/or generators. Such a setup forces the programmer to read and write packages of information at a time, and can allow lazy creation.

This may imply that you should always communicate copies of data, and not heap references.

When handling large amounts of data and you wish to avoid unnecessary copies (databases would, for example) you could probably get away with MVCC-style setup, but it does mean a lot of management even if it avoids locking.


See how things are handled e.g. in

Versioning

lock-edit-commit-unlock

A resource is locked as a whole.

The easiest to code.


Tends to scale badly, because things are limited by the speed the lock can be locked and unlocked. Latency tends to reduce the maximum rate, and things will sit around waiting. (The way they wait can also be relevant, see e.g. polling/spinlocks versus event-driven things)


Typically the same actor that locked it must unlock it.

...so in some cases this can leave the resource unusable - until a timeout (or even permanently).


Note that instead of an exclusive lock, you could use an exclusive write lock: others can read, but none should/can write without a lock. This is a little harder to code in that you may now have to deal with atomicity of the changes.

branch-and-merge, copy-edit-replace

Lockless, so people/processes don't lock each other while they work.

A copy(/branch; the difference is mostly in terminology and housekeeping) is edited, and the changes are re-committed.


Can be useful when difference alterations are likely to be independents somehow - different files, different regions in text files, or such.

Often relies on (mostly-)automatic conflict resolution (...partly because manually resolving such problems is probably a time loss comparable to waiting for a lock).

'Just use the last version' may be good enough in some situations -- but that would mostly just replace with a single version and throw away all others, which is only a satisfying solution, well, when it makes sense in the particular application.


MVCC

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


tl;dr:

  • an alternative to aggressive locking - but cannot completely remove the need for locks
  • it tends to avoid reading clients blocking writing clients, and the other way around
so e.g. read-heavy workloads tend to wait less
  • it can not prevent the occurence of dealocks
though you can informedly make them less likely


Multiversion Concurrency Control (MVCC) steps away from handling data both monolithically and purely sequentially (or from doing table locks, which basically means 'sequential' anyway).

Its idea is to isolate only distinct versions of data, and often works out as each client seeing/handling a consistent snapshot of a certain unit of data (where that unit may be a database row, a file, or such) within a certain transaction (e.g. a database transaction, a file operation, etc.).


Details vary with use case (e.g. databases, files, etc), and even with implementation within a use case (e.g. there are a few different MVCC systems in RDBMS systems)


In some applications, it can work out as a variant of copy-edit-replace which is aware of transactions and basic data modeling.


In some situations, for example where commits on the same database tables aren't on the same rows (e.g. different people updating their own profile text) or not conflicting in other ways (say, someone clicking 'set name' on their user profile twice), it can be a perfectly useful middle way between the slowness of locking the entire table for each such operation, and the possible inconsistency of dirty reads and such.


MVCC works well in mostly-read, occasional-write, particularly with no necessity to see the very latest at the earliest possible time - because you can serve the latest committed version without delay, even if there is a newer version being worked on.

From that perspective it provides lockless versioning. If properly set up, it need not break (m)any other guarantees.


When there are multiple writing clients at the same time, things depend on decisions. You can say that the one that committed last becomes current, and the rest is old - it is usually fairly trivial to decide which of these is old, though you can make this cleverer and more complex if necessary.

This may sound like a bad idea, but consider that sequentially handling commits that were isolated alterations on the same basic data amounts to the same idea but with the latest commit time being decisive. In some situations you want really these operations to be serialized, in others it really doesn't matter.


MVCC is not a very good idea when there are dependencies you can break - but that's arguably a misapplication or a fault in modelling, or code - not locking where locking is necessary under most every underlying model.

Note that it certainly matters what your unit of data encompasses (and doesn't), and also whether there is transactioning around it (which may make overall management and cleaning of old versions a little harder - see e.g. the guts of Postgresql's MVCC).


Larger-scale computing

The Seven Fallacies of Distributed Computing:  Reliable delivery;  Zero latency; Infinite bandwidth;  Secure transmissions;  Stable topology;  Single adminstrator;  Zero cost. --Peter Deutsch


There recently has been a general move to get more processing power by interconnecting many machines, either in your own cluster or in a cloud service. As opposed to putting more power into a single monolithic - and often disproportionately pricy - workhorse.


Which of the specific options is best depends on many things. You can think and read and write about this one for years, but a little common sense thinking about effective bottlenecks goes a long way.


Threading versus message passing

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Broadly speaking, there are two approaches to communication in parallel execution: threading + shared-memory techniques, and message passing.


Multiprocessing models:

  • UMA (Uniform Memory Access), commonly as SMP:
processor owns no memory (other than its cache), and all view the same memory
e.g. a single bus from CPU to memory controller
speed very predictable, no special planning necessary
...but doesn't well beyond a handful of cores
  • NUMA (Non-Uniform Memory Access): processor owns its own memory, and can view that of others more slowly
e.g. HT and QPI means each CPU talks directly its own physical memory (yes, memory controller is in the CPU now), and interchanges internally when necessary.
NUMA allows further (vertical) scaling than strict UMA
varying distance/time to different memory and dealing with ownership means access speeds vary more -- yet scale better with more cores
cache coherency is more complicated (and slower), which stops the scalability at some point
  • Cluster: nodes owns their own memory, cannot address that of others

Horizontal scaling inevitably slides down this scale.


To quickly contrast:

When the same memory is mapped by the same thread (of execution, so this includes processes, it

  • allows no-copy sharing
  • requires explicit synchronization

On a single workstation/server (UMA/SMP, or small NUMA via HT/QPI), transfer costs are low and predictable, which makes this more efficient than cluster.

It can be error-prone to have a basis that never synchronizes/locks until you realize you need it (though there are now languages with direct awareness of threading safeness, and better threading libraries for those that are not).


Note that UMA and NUMA can be mixed, e.g. on double-CPU motherboards.


Some larger systems still manage to be NUMA fairly efficiently (e.g. being aware of its own topology).

...but at some point you have to give up and break from it, and effectively go cluster instead.


At this point, message passing becomes preferable.

Data is always shared via a send/copy - which usually is the case anyway (and can sometimes be upgraded to no-copy within the same host under certain conditions)

Message passing is cooperative (a send has to be received), synchronization is implicit with complete sends (avoiding many race conditions).

When data serialization is often part of the protocol, this is easily more portable.

...and many of these details can make the overhead higher. How much higher depends on how much you work with it.

Single-host computing

Distributed, parallel, cluster, grid

Memory issues

Service and business - the cloud thing

Parallel processing libraries

OpenMP

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

OpenMP targets shared-memory systems (only) - so in some ways is just an alternative to pthreads/windows threads.

A compiler-level (C, C++, FORTRAN) thing that decently abstracts, so makes it fairly easy to safely access data structures, make syncing easier, do parallel sections, parallel loops, and such.


Not to be confused with (Open)MPI.

MPI

MPI is a spec. Implementations include OpenMPI, MPICH, HP-MPI, and others - see [2].


There are many things like it. MPI is a little more language/environment-agnostic than many. If a little overdesigned.


When/why MPI?

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Some notes on MPI structure and jargon

Examples

Hello world

A minimal hello world is:

#include <stdio.h>
#include <mpi.h>
 
main(int argc, char **argv) {
  int ierr;
 
  MPI_Init(*argc,**argv);   // returns error code, here we assume it won't have trouble
  printf("hello world\n");
  MPI_Finalize();           // ditto on the error code
}
As part of the init (which gets some details via the fact that you'll start it via
mpirun
with something like -np 4), one process takes care of starting the others, which means that all of them print hello world.


When you want to tell the processes apart, or communicate, you'll probably want to fetch and report the rank (and often size)

Distributing jobs

MPI config trouble

A requested component was not found, or was unable to be opened

At different levels, the lines

ompi_rte_init failed --> Returned "Not found" (-13) 
ORTE_ERROR_LOG: Not found in file runtime/orte_init.c

seem to mean the same thing.


ompi_info
should list at least a few dozen MCA lines.

If it shows only a few, and the following error, then you've probably got a mix of openmpi versions installed, and paths (binary/library) effectively mixing them.

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      myhostname
Framework: crs
Component: none

However, this is not always a good test when specific software may do things in its own environment.

Keyval parser: error 1 reading file mpicc-wrapper-data.txt

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Seems to not be fatal (compilation works), but presumably it won't use the configuration.


Seems to typically be a bad openmpi config (possibly because of updates, mixing versions, etc). I have no idea in which way it's bad, and it seems that since the problem is rare no one has really figured that out.

A fairly thorough way to fix it is to uninstall MPI, clean any leftovers (e.g. anything that
locate -e openmpi
mentions that looks like config files), and reinstall(/rebuild).


Runtime MPI Errors

For help debugging programs using MPI, see http://www.open-mpi.org/faq/?category=debugging and similar, and some of the below.


mca_btl_tcp_endpoint_complete_connect] connect() to ipaddress failed: No route to host

Boils down to a generic TCP "No route to host" error (which is what errno 113 refers to).

Check NIC status and link lights (e.g. ethtool).

Check routing rules (e.g. netstat -r).


Error 137

Seems to be a generic out-of-resource error.

In our case it came from a process that allocated more than it should have, and was oom_killled


MPI_ABORT was invoked on rank 0 in communicator mpi_comm_world

Regularly


See also


Notes on threading implementations (lower levels)

tl;dr

On hardware

APIs and implementations

On threads and cores

Notes on thinking about threads and other parallelism (higher level)

Unsorted