Locking, data versioning, concurrency, and larger-scale computing notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
 
(36 intermediate revisions by the same user not shown)
Line 11: Line 11:


<!--
<!--
A 'thread' is a term used in multiple senses.
The term "thread" is a term used in a handful of senses.


The widest sense is probably "executing some code in a way that is distinct from others", often referred to as a thread of execution. Or task - this can be a generic term (but has more specific meaning in some contexts)
The widest sense is probably "executing some code in a way that is distinct from others", often referred to as a '''thread of execution'''.
 
Or task - itself a sometimes-generic, sometimes-specific term (depending on contexts)




Line 149: Line 151:


===="Thread"====
===="Thread"====




=====Thread of execution=====
=====Thread of execution=====
<!--
<!--
'Thread' seems to have been coined as a sequence of instructions that
'Thread of execution' is the loosest sense means "task that is separate from another task".
'Thread of execution' is the loosest sense means "task that is separate from another task".
: and generally, "...that probably has enough of its own state that it can work do independent of another one of these".
: and generally, "...that probably has enough of its own state that it can work do independent of another one of these".




Whether that is a hardware-level thing, or a completely in-software abstraction, is up to context.
This is still abstract, rather than a specific implementation.
 
We're getting closer, but we're still pointing out the abstraction, loudly, because
how much of its own state, and how much state it shares,
is half the difference between a lot of implementations.
 
So, just to be excruciatingly clear:
whether that is a hardware-level thing, or a completely in-software abstraction, is up to context.


This makes it a term that has been abused, and is imprecise in many contexts.
This makes it a term that has been abused, and is imprecise in many contexts.
Line 164: Line 178:
Consider the following, and their differences:
Consider the following, and their differences:
* hardware interrupts
* hardware interrupts
: represent "stop everything, run this code, continue", so is certainly its own flow.
: represent "stop everything NOW NOW (well, save what you were working on), run this code, (restore what you were working on) continue".
: Hardware interrupts are usually
: (interrupts have, over time, adopted more shades, but they retain this sense of ''now'' - hence the name)


* processes
* processes
: many programs you run become one process (sometimes a few)
: many programs you run become one process (sometimes a few)
: in the sense of hardware-based/supported pre-emptive multitasking
: in the sense of hardware-based/supported pre-emptive multitasking
: exist by merit of opcodes like "save CPU state to memory"
: exist by merit of
:: opcodes like "save CPU state to memory"
:: the OS modeling these for you to use


* threads within a process  
* threads within a process  
:: often software-only, though in ''some'' cases they are considered in the same schedular code that schedules processes
: exist by merit of
:: libraries
:: OS support does not have to be very smart about threads to allow them to be used at all -- but it ''helps'', so most do.
: often software-only, though in ''some'' cases they are considered in the same schedular code that schedules processes


* micro-threads[https://en.wikipedia.org/wiki/Micro-thread_(multi-core)]
* micro-threads[https://en.wikipedia.org/wiki/Micro-thread_(multi-core)]


* [[cooperative multitasking]], [[coroutines]], and frameworks based on them (e.g. greenlets)
* [[cooperative multitasking]], [[coroutines]], and frameworks based on them (e.g. greenlets)
: purely software-based
: purely software-based - the OS does ''not'' know about these, and can neither be in the way, or help you
 
* you ''could'' implement your own execution model inside your program.
: In many cases this is more work than it's worth.
: Yet some basic constructions qualify -- say, an '''[[event loop]]''' need not be anything more than a queue of tasks that you pick things off first-come-first-serve.


* anything an emulator is doing
* anything an emulator is doing
Line 185: Line 210:




You ''could'' implement your own execution model inside your program.
Event loops are funny.
In many cases this is more work than it's worth.


For example, '''event loops''' often amount to having a queue of tasks that you pick things off first-come-first-serve.
When intended to be responsive, they are often ''used'' as a sort of manually written cooperative multitasking system: you try to create events that can be handled in small amounts of time, or handlers that do only small chunks at a time.


When intended to be responsive, they are often ''used'' as a sort of manually written cooperative multitasking system: you try to create events that can be handled in small amounts of time, or handlers that do only small chunks at a time.
One upside of event loops is that they are not special, so do not need a threading API or such.  


One upside of event loops is that they are not special, so do not need a threading API or such.
If you write it one-thing-at-a-time approach often makes it simpler to see when things are or are not conflicting.
The one-thing-at-a-time approach often makes it simpler to see when things are or are not conflicting.


And sometimes they can be a cheap way of starting a ''buttload'' of jobs (because it takes only storage - whereas when each job must be a thread or processes, because that takes a finite OS resources).
And sometimes they can be a cheap way of starting a ''buttload'' of jobs (because it takes only storage - whereas when each job must be a thread or processes, because that takes a finite OS kernel resources).


The typical downside is that, as cooperative multitasking, ''you'' are doing most of the scheduling,  
The typical downside is that, as cooperative multitasking, ''you'' are doing most of the scheduling,  
Line 212: Line 235:




Most processors these days have multiple cores, referring to completely distinct computational resources.
Processors with multiple cores have been the standard for quite a while now.


They could run one task each, (relatively) independent of other cores.
If you squint, this just adds more of the same and interconnect them, and they're just more "execution units" or something that (seen at low enough level) run one task each, (relatively) independent of other cores, and sometimes they want the same resource and it's not quite twice as fast?


That would be glossing over a lot of details that matter, but ''broadly'' yes.




A CPU will contain a few distinct parts (instruction decoding, integer work, floating-point work), and most workloads will not draw on them equally.


So CPU makers figured that if you can make the CPU pretend each core is actually two, then the OS will then schedule two processes on it.


{{comment|(Terms like 'execution units' are clearer and preferred, but the naming of these extra fake cores isn't very consistent, even in more technical use and less so in everyday use. You'll probably still know what's going on when e.g. something says there are 6 cores and 12 threads, or 4 physical cores and 8 virtual cores.)}}




'''Low level stuff that you may want to know about -- just enough to know what to ignore'''


Now one core has two tasks.
There are a few footnotes to that:


It can figure out which parts of the core are idle and, say, do instruction decoding for one task while doing integer math for the other.
Let's start with '''instruction-level parallelism'''
: based on the observation that a CPU will contain a few distinct parts (instruction decoding, integer work, floating-point work
: What in assembly looks like a single instruction is sometimes a complex combination of things at low enough level, and we tend to draw on only one at a time
: you can imagine that the CPU might be creative enough to do things in a way where sometimes, di


: all this scheduling and creativity is entirely internal to the CPU - you don't get control over this, the best you can do is know roughly what is there


More than that, it can figure out which parts of even ''one'' instruction stream does not depend on other,
so can be ''reordered'' to more efficiently do such unrelated things - while still being fully correct.
(Look for terms like instruction pipelining and out-of-order execution if you care what it ''actually'' does)


And it can go into a lot more micro-managed detail - and the point is that this instruction-level parallelism is ''purely'' managed by the CPU itself.
instruction pipelines are is a useful abstraction around this - basically the idea




This is [https://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous MultiThreading (SMT)], better known as its most typical implementation, Intel's HyperThreading (on the silicon it implies things like instruction reordering, speculative execution, register renaming -- but this is transparent to the OS). There are a few different stances on it:
there would be ways of overlapping execution
* for everyday task-switching work, you may see 20% increase - not bad for basically no extra silicon.
* if you do floating-point number crunching, it typically makes no difference (you typically have as many FPU/vector units as you have ''real'' cores)
* it will probably trash your cache faster
* if you can feed your CPU data fast enough, you don't need it (so: depends on main workload)


-->
so if the CPU can ordering of instructions so that they draw on distinct functional units -- and can be overlapped during execution
: which is something the CPU can (and ''should'') do entirely by itself.


=====Multithreading=====
<!--




Multithreading usually means "one process has instantiated multiple distinct threads of execution"
They are related to the technologies of superscalar execution, operand forwarding, speculative execution and out-of-order execution.


And then we usually mean the more mainstream ones that the OS scheduler knows about and can basically schedule individually.




However, in basis
: one process runs on one execution unit at a time.
: only one thread in the process can run at a time.


Assuming for a moment instruction level parallelism doesn't help anything, '''how does this help us at all?'''




The first answer to that is scheduling.
* the '''superscalar rather than scalar'''
: which is about




When you have unrelated tasks - say, distinct jobs, os e.g. an IO thread and a CPU thread,
then you know that each often has something to do that is not blocked by another.


Meaning that they can both be scheduled, and if they say 'nothing to do' then the other will probably be scheduled more often.


Even on a single CPU, this is convenience because those two threads don't have to know about each other for this to work.




Because without threads, we can still do this.
So what is the difference between
having multiple cores
superscalar view of multiple execution units
different from instruction-level view of multiple pipelines?


One way to deal with that without threads is event loops: split things into small chunks,
What is the difference between
do one at a time, hand back control.
a pipeline and an execution unit?
Done right, that works quite well and incurs ''none'' of the potential issues of threads {{comment|(This is e.g. how javascript says things should be done - it's execution model doesn't allow it to be multithreaded due to its original conception)}}


what is the difference between superscalar and multicore?


The event loop, is ''cooperative multitasking'', meaning it is the running task's job to hand back control sooner rather than later, and it's ''really'' easy to forget to do that quickly enough.


But doing that right means you need to think about how you may block things, and how long you might do so.
Because in a way,  
both mean a processor can execute more than one per cycle
:: so where a scalar processor does one instruction per clock cycle - a


In some ways, the largest (and arguably only) upside to a thread over an event loop is that it's more guaranteed to get scheduled -- because it's the OS rather than you.


execution unit is something that does the calculations forwarded from the instruction unit.
An EU tends to have its own registers


When you start making them relatively


Again, you can do this right - a lot of oldschool games are testiment to this, because the hardware ws so basic that the only way games ''could'' work was to write code as a sort of realtime OS.


Things like node.js do this too - because JS cannot do multithreading.
to have multiple parallel functional units within its execution units, which is referred to as superscalar design. The simplest arrangement is to use a single bus manager unit to manage the memory interface, and the others to perform calculations. Additionally, modern CPUs' execution units are


But when doing cooperative multitasking, guaranteeing you ''never'' hand over control later than planned is damn hard. And when we do, things hiccup, possibly hang entirely.


{{comment|(For a real example, esp8266 lets you define high level functions as distinct tasks (a little more easily in nodemcu/lua than in arduino/C), and notes that you ''must'' yield control every so-many milliseconds so that its wifi code can get scheduled -- or wifi breaks.)}}




Once we came up with the idea of multitasking, that quickly became an unacceptable state of affairs.


: so if
:: we just ''pretend'' there are multiple entirely distinct pipelines that actually share almost all of those parts
:: each accepts entirely distinct tasks and do their own thing
:: the degree to which they draw on only one part at a time is the degree you get parallelism for sort-of-free
:: (well, free for us - there's a lot of work in engineers figuring out those related pipelines should cooperate best)
:: so where a scalar processor does one instruction per clock cycle - a superscalar processor can execute more than one per cycle
: it turns out that in a lot of general-purpose processing{{verify}}, this is quickly diminishing returns, and we stop at two


So we want every task to only take so long, always.  
So CPU makers figured that if you can make the CPU pretend each core is actually two, then the OS will then schedule two processes on it.


And if we want to guarantee that greedy tasks can never block your ability to finish quickly even accidentally, then instead of cooperative multitasking, we want '''pre-emptive''' multitasking, meaning we can take a task off the CPU ''without its cooperation''.




Now that we have a scheduler throwing things on and off, we can both keep it busy and guarantee every process gets a turn soon.






Back to the question '''why do threads help us at all?'''
This is where the terms are ''immediately'' confusing?
What's the difference between core, an execution unit, and a pipeline?
Is each a thread of execution, or a pipeline?






We can e.g. '''put CPU-work and IO work in distinct threads''', meaning that if the IO thread blocks, it doesn't need to be scheduled.  
{{comment|(Terms like 'execution units' are clearer and preferred, but the naming of these extra fake cores isn't very consistent, even in more technical use and less so in everyday use. You'll probably still know what's going on when e.g. something says there are 6 cores and 12 threads, or 4 physical cores and 8 virtual cores.


You ''can'' do the same with cooperative multitasking and polling, and some programs do.
Also, once we get to scheduling, you might point out you can run one task in an execution unit - or perhaps more precisely, pipeline, or perhaps more precisely, pipeline ''stage''. Optimizations make this a little messy)}}
But it's more work, more bookkeeping, and the code often isn't as clear.




There is a similar argument for distinct instances of the same sort of job, which need to do their own IO (or hand it off to an IO thread) - there's a lot of 'what can happen when' that is handled for you.


Now one core has two tasks.


...or any other events that are largely or entirely asynchronous, like distinct connections in a web server (and the management of the conneciton pool pool).
It can figure out which parts of the core are idle and, say, do instruction decoding for one task while doing integer math for the other.




That, and you can add priorities - some jobs may be more important, and you can just tell the scheduler a single number. Doing that yourself is possible, but more work.
More than that, it can figure out which parts of even ''one'' instruction stream does not depend on other,
so can be ''reordered'' to more efficiently do such unrelated things - while still being fully correct.
(Look for terms like instruction pipelining and out-of-order execution if you care what it ''actually'' does)


And it can go into a lot more micro-managed detail - and the point is that this instruction-level parallelism is ''purely'' managed by the CPU itself.




This is [https://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous MultiThreading (SMT)], better known as its most typical implementation, Intel's HyperThreading (on the silicon it implies things like instruction reordering, speculative execution, register renaming -- but this is transparent to the OS). There are a few different stances on it:
* for everyday task-switching work, you may see 20% increase - not bad for basically no extra silicon.
* if you do floating-point number crunching, it typically makes no difference (you typically have as many FPU/vector units as you have ''real'' cores)
* it will probably trash your cache faster
* if you can feed your CPU data fast enough, you don't need it (so: depends on main workload)


-->


Threads are tasks that are known to the OS scheduler, so basically means pre-emptive multitasking in a more fine-grained way.
======The sidequest of superscalar details======


And because they share most resouces within the process, generally only one one of them should run at the same time.
<!--




Well, you can add locking to prevent problems, but requiring doing that extensively defeats part of the point of using threads (but only part).
-->


====Threading within OS concepts====
<!--


Note that compared to code that cooperates ideally, multithreading is never any faster.
More complex operating systems tend to model
* processes, as well isolated from each other, and generally long lived
* threads, which easily intereact, and may be short-lived


However, it's often a lot less entangled to to put things into a few different threads.


And then things like FreeRTOS, which is mostly just an embedded thing,


tl;dr: threads are ''concurrency'' in the sense of 'interleaving things we can know are unrelated', not ''parallelism'' in the sense of 'doing at the same time for more speed'


There is a difference between


...but it shouldn't be surprising that CPU design has grown along with the concepts


-->






=====Multithreading=====
<!--


'''Can threads run concurrently on distinct cores?'''
Multithreading has a meaning in hardware.


Before multicore, this wasn't done.  
Two completely distinct ones, in fact - so the term alone is in fact ambiguous.


The abstraction itself doesn't necessarily keep it from happening (they are entirely distinct, after all)
{{comment|They do share resources, not least of which memory, and even though trampling over shared resources is fundamentally a bug, sharing memory is not. Explicit synchronization slows things down, but that's to be expected, and probably required.}}


And and multi-CPU computers were a thing before multicore. 
And then, in software, multithreading has a different meaning to ''those''.
But it didn't make sense to use threads this way,  
roughly because it only makes potential sense on tightly related CPUs, because even though you can get the hardware to figure out ''all'' of the the memory coherence to two processes accessing the same memory space,
the delay it may incur is not good. The more interaction there is, the less efficient this is.


This is why also why you still probably wouldn't try this on NUMA (less-related CPUs), or GPU (which are vector processes), today.
Multithreading usually means "one process has instantiated multiple distinct threads of execution"


And then we usually mean the more mainstream ones that the OS scheduler knows about and can basically schedule individually.


So while you can do it, chances are that ''explicitly'' parallelizing distinct jobs into distinct processes for distinct cores would easily make you design a better-performing system. And potentially a more correct one.
Both because you're now thinking in terms of (minimal) message passing.




https://stackoverflow.com/questions/34889362/does-linux-user-level-pthread-thread-run-on-multiple-cores
In hardware, we originally had temporal multithreading (a.k.a. super-threading),
where only one thread of instructions can exist in a given execution unit pipeline at a time.


https://scalibq.wordpress.com/2012/06/01/multi-core-and-multi-threading/
This matches the classical "multitasking only ''looks'' parallel, but it's just switching quickly enough"


(this also happens to match how, on the software side, we explain threads within a process: concurrent, but not parallel)




"Kernel threads" are another source of confusion
But then we figured we can have multiple pipelines, which development-wise was a ''relatively'' easy step to take from CPUs that never thought of this before.  We can add more, with minimal changes except in how things get fetching into each pipeline.


https://stackoverflow.com/questions/34889362/does-linux-user-level-pthread-thread-run-on-multiple-cores
It happens to not scale very well, because that fetching contends for resources.
This is why ''most'' CPUs that implement this stop at just two, or soon after.




-->
It turns out there's a smarter way.


====Concurrency and parallelism====
<!--


'''Parallelism''' means getting ahead with multiple tasks ''at the same time''
: this usually implies both hardware that can actively do things at the same time,
: and a way of exposing that to programmers in a sane way




'''Concurrency''' suggests that multiple tasks are allowed to be in the middle of work.
: it says nothing about whether they are being run at the same point in time or not.
: concurrency can help even with zero parallelism - consider that if one task has to wait for any reason, then instead of twiddling our thumbs, another task can probably run.


In most OSes, that means one process on one execution unit/pipeline
In basis
: one process runs on one execution unit at a time.
: only one thread in the process can run at a time.


'''how does this help us at all?''' (...assuming for a moment instruction level parallelism doesn't help anything)


One of the main reasons these are fuzzy to us because the OS abstracts this out in some way.
You give it unique tasks, it makes evaluation happen.


And if they're built on the same execution model (an OS-level thing), applications can run without modification regardless of how that needs to get placed on the hardware that executes it (which is where acronyms like SMP and NUMA turn up).
The first answer to that is scheduling.


In fact, the abstraction at the OS level is generally ''great'', because you then don't have to learn what exactly [[pre-empting]] and [[cooperative multitasking]] and [[NUMA]] and [[SMP]] {{comment|(and the vague, incorrect use of the last)}} means before you can use the PC.


Another part of the answer lies in the difference between '''multithreading''' and '''Simultaneous MultiThreading (SMT)'''.
SMT


And if after a reasonable time all your tasks got done, who cares whether that happened in a time-sharing, quickly interleaved, or highly parallel way?


...one answer to that is that ''if'' you have multiple distinct execution cores (and this has been reality for two dozen years now), and tasks were made to be independent anyway, running them ''at the same point in time'' means things can get done in less wallclock time.


When you have unrelated tasks - say, distinct jobs, os e.g. an IO thread and a CPU thread,
then you know that each often has something to do that is not blocked by another.


A bit more practically, take Javascript - the core of javascript is not parallel execution at all - by its language definition it isn't parallel, but it does do concurrency.
Meaning that they can both be scheduled, and if they say 'nothing to do' then the other will probably be scheduled more often.


Even on a single CPU, this is convenience because those two threads don't have to know about each other for this to work.




'''Relation to multitasking'''
Because without threads, we can still do this.


In concurrency without parallelism, only one task runs, and the switching is usually a [[cooperative multitasking]] thing: the task gives up control explicitly.
One way to deal with that without threads is event loops: split things into small chunks,
do one at a time, hand back control.
Done right, that works quite well and incurs ''none'' of the potential issues of threads {{comment|(This is e.g. how javascript says things should be done - it's execution model doesn't allow it to be multithreaded due to its original conception)}}


Largely because that's easier to implement.


It also makes it easier for the tasks to stop only in specific places where it knows state is safe.
The event loop, is ''cooperative multitasking'', meaning it is the running task's job to hand back control sooner rather than later, and it's ''really'' easy to forget to do that quickly enough.
Yet that means it'll postpone that, meaning there is no guarantee of how snappy anything is on the system
unless you control all its parts.


But doing that right means you need to think about how you may block things, and how long you might do so.


General purpose PCs care more about snappiness of everything,  
In some ways, the largest (and arguably only) upside to a thread over an event loop is that it's more guaranteed to get scheduled -- because it's the OS rather than you.
and we end up with [[pre-emptive multitasking]] that means tasks can be switched ''at any time'',
and is more complex in that you need a separate scheduler to do that.




For example,
* ''Multiprocessing'' started out on single-core, and was more about concurrency.
: As we grew more cores, it was also useful for parallelism.
: today, cores don't really get faster, we get more cores


* ''Multithreading'' is to this day still mostly about concurrency, much less about parallelism.
Again, you can do this right - a lot of oldschool games are testiment to this, because the hardware ws so basic that the only way games ''could'' work was to write code as a sort of realtime OS.  
: except that [[win32 threads]] act more parallel than [[posix threads]] {{verify}}
: is asynchronous as well (though you can break)


Things like node.js do this too - because JS cannot do multithreading.


* ''Async IO'' approaches are used only for concurrency, not parallelism
But when doing cooperative multitasking, guaranteeing you ''never'' hand over control later than planned is damn hard. And when we do, things hiccup, possibly hang entirely.
: because it can be lighter e.g. when you tiny tasks, or potentially ''very many''


{{comment|(For a real example, esp8266 lets you define high level functions as distinct tasks (a little more easily in nodemcu/lua than in arduino/C), and notes that you ''must'' yield control every so-many milliseconds so that its wifi code can get scheduled -- or wifi breaks.)}}




For example,  
Once we came up with the idea of multitasking, that quickly became an unacceptable state of affairs.
* JavaScript gives you zero parallelism (it's single threaded),
: which means there is a lot of focus on concurrent ways of programming
: there are extensions that allow parallelism, added by node, and the browser (though they are isolated to not break JS's language model)




So we want every task to only take so long, always.


And if we want to guarantee that greedy tasks can never block your ability to finish quickly even accidentally, then instead of cooperative multitasking, we want '''pre-emptive''' multitasking, meaning we can take a task off the CPU ''without its cooperation''.




Now that we have a scheduler throwing things on and off, we can both keep it busy and guarantee every process gets a turn soon.




'''So what is this 'async IO' in that context, and when is it useful?'''


See the next section.
Back to the question '''why do threads help us at all?'''






We can e.g. '''put CPU-work and IO work in distinct threads''', meaning that if the IO thread blocks, it doesn't need to be scheduled.


-->
You ''can'' do the same with cooperative multitasking and polling, and some programs do.
But it's more work, more bookkeeping, and the code often isn't as clear.


====Coroutines and cooperative multitasking====
<!--


tl;dr:
There is a similar argument for distinct instances of the same sort of job, which need to do their own IO (or hand it off to an IO thread) - there's a lot of 'what can happen when' that is handled for you.
* coroutines is code within a task
* the language you work in often makes things simpler
* coroutines are an abstraction that gives you less than - but are also typically cheaper to do than - processes or even threads




Coroutines tend to refer to a language-assisted way of doing [[cooperative multitasking]]:
...or any other events that are largely or entirely asynchronous, like distinct connections in a web server (and the management of the conneciton pool pool).
more than one task can be considier to be in the middle of work,
and there is some way of handing control to them, and for them to give it back.




That, and you can add priorities - some jobs may be more important, and you can just tell the scheduler a single number. Doing that yourself is possible, but more work.


The idea of handing control over to each other is fine from a theoretic distance,  but reality also has to contend with IO.
{{comment|(...in fact ''any'' blocking that can happens outside our control, but in practice you can design that away in almost everything except IO)}}


If you do coroutinges inside the way OSes typically do processes/tasks,
then ''any'' blocking'' IO would IO would still halt the current task,
and thereby ''everything'' inside it (remember what cooperative means in [[cooperative multitasking]]).




From a wider view, you ''have'' to have an IO API that works asynchronous-style
for cooperative multitasking to make sense to do.


Threads are tasks that are known to the OS scheduler, so basically means pre-emptive multitasking in a more fine-grained way.


And because they share most resouces within the process, generally only one one of them should run at the same time.


Also note that in itself, '''cooperative multitasking''' ''only'' gives you concurrency, not parallelism'''.


And that multiprocessing also give you concurrency, but whether it can get evaluated in parallel depends on the hardware (and OS).
Well, you can add locking to prevent problems, but requiring doing that extensively defeats part of the point of using threads (but only part).


And that multithreading also give you concurrency, but whether it can get evaluated in parallel depends on both the threading model and the hardware (and how (in)efficient it might be further depends on the specific hadware).


Note that compared to code that cooperates ideally, multithreading is never any faster.


So why choose cooperative multitasking?
However, it's often a lot less entangled to to put things into a few different threads.




When you will typically have a ''very large'' amounts (thousands+) of all-tiny tasks, then
tl;dr: threads are ''concurrency'' in the sense of 'interleaving things we can know are unrelated', not ''parallelism'' in the sense of 'doing at the same time for more speed'
: you probably need only concurrency, not parallelism
: the overhead of starting that many processes, or even threads, starts getting prohibitive amounts of overhead (via at least kernel bookkeeping)


Cooperative multitasking is, in the end, just data in a process a loop looks at (quite comparable to an event loop and possibly implemented that way),
so while the abstraction gives you less, it also costs less.




So yeah, use async only when you know it's a good solution.








-->
'''Can threads run concurrently on distinct cores?'''
=====async and await=====


{{inlinecode|async}} and {{inlinecode|await}} are syntactic sugar present in some languages,
Before multicore, this wasn't done.
that often mean you're doing cooperative multitasking (/coroutine) style programming


...where the interpreter/compiler will be helping you judge correctness.
The abstraction itself doesn't necessarily keep it from happening (they are entirely distinct, after all)
{{comment|They do share resources, not least of which memory, and even though trampling over shared resources is fundamentally a bug, sharing memory is not. Explicit synchronization slows things down, but that's to be expected, and probably required.}}


And and multi-CPU computers were a thing before multicore. 
But it didn't make sense to use threads this way,
roughly because it only makes potential sense on tightly related CPUs, because even though you can get the hardware to figure out ''all'' of the the memory coherence to two processes accessing the same memory space,
the delay it may incur is not good. The more interaction there is, the less efficient this is.


Things get a little more interesting if you mix these functions into code that is ''not'' - which often comes down to something like 'use await to make it act like a blocking call anyway'.
This is why also why you still probably wouldn't try this on NUMA (less-related CPUs), or GPU (which are vector processes), today.




So while you can do it, chances are that ''explicitly'' parallelizing distinct jobs into distinct processes for distinct cores would easily make you design a better-performing system. And potentially a more correct one.
Both because you're now thinking in terms of (minimal) message passing.


See e.g.
* [[Python_usage_notes_-_concurrency#async.2C_await]]


* [[Javascript_notes_-_syntax_and_behaviour#async.2C_await]]
https://stackoverflow.com/questions/34889362/does-linux-user-level-pthread-thread-run-on-multiple-cores


===Glossary of exclusiveness and messiness===
https://scalibq.wordpress.com/2012/06/01/multi-core-and-multi-threading/


====Atomic operations====


<!--
Atomicity means that given an operation to change state, either all of contained changes happen, or none of them - the change is indivisible and irreducible.


"Kernel threads" are another source of confusion


For various uses, atomicity should often be introduced alongside isolation.
https://stackoverflow.com/questions/34889362/does-linux-user-level-pthread-thread-run-on-multiple-cores


Isolation of a particular operation means that when one actor alters state, another actor will only ever be able to see the old state and the finished state, nothing inbetween.




Yes, there are solutions (simple ones, like locking) guarantee both at the same time - but also some that do only one.
[https://en.wikipedia.org/wiki/Simultaneous_multithreading]


''Both'' matter to safety, and avoiding race conditions between threads/processes, database [http://en.wikipedia.org/wiki/ACID ACID]ity), and more.


-->


====Concurrency and parallelism====
<!--


Atomicity can be applied at various levels of complexity. For example
* A single CPU operation is atomic by nature


* assignments of variable regularly implicitly are
'''Concurrent''', '''Concurrency''', suggests that (if you were to take a snapshot) multiple tasks are allowed to be in the middle of work.
: it says nothing about whether they are being run at the same point in time or not.


* e.g. a scripting language may e.g. want to guarantee that 'append this element to the list' is atomic. That append will involve multiple smaller operations and you don't want another thread to read the list in a half-updated state
Concurrency can help even with zero parallelism
: consider that if one task has to wait for any reason, then instead of twiddling our thumbs, another task can probably run.
:: whether that is two completely separate processes
:: or two piece of code, given the right abstractions.  
::: For example, [[generators]] and other lazy evaluation imply concurrency


* many filesystem metadata alterations are atomic updates (filenames, etc.)


* a database system's transaction (multiple operations) either all completes, or none of it completes.
'''Parallel''', '''Parallelism''', means getting ahead with multiple tasks ''at the same time''
: Those who work on databases can relate this to ACIDity, specifically to atomicity ''and'' isolation. Consider how transactions appear to others only when you commit, and are not seen when you roll back.
: for some value of 'at the same time', but often amounts to 'with no explicit regard for other processes' meaning hardware is free to do it a the exact same time in different silicon, if it can manage to do so. {{comment|(If not, it's probably effectively just concurrent - but often with the same API that might be parallel elsewhere, which is worth something for developers)}}
: Say that buying a plane ticket online consists entirely of paying for it and reserving the seat. If you effectively make this change atomic, you won't have phone calls dealing with all the edge cases.
: this usually implies both hardware that can actively do things at the same time,
: and a way of exposing that to programmers in a sane way


Making higher-level code atomic often relies on lower-level atomicity.


-->
'''Embarassingly parallel''' - describes workloads that would take little to no effort to split into parallel tasks
: often because tasks they need no communication because they have no relation to each other.
: embarassing in the sense that we didn't do so already, or made it easy in to in code{{verify}} [https://en.wikipedia.org/wiki/Embarrassingly_parallel]


====Critical section====
<!--
A piece of code that should be executed by at most one thread of execution at a time, because it accesses a resource that is effectively shared (data, IO, whatnot).




Critical sections are an ''idea''.


How exactly this is done (...without structural inefficiencies...) varies per architecture - of hardware, OS-level abstractions, and more.
Multiprocessing is concurrent first, parallelism ''maybe''-but-now-frequently
(Well, if you have multiple cores or multiple CPUs. If you have just a single core, it can still only get schedules sequentially, but the architecture that ''allows'' multiple, the scheduler will be thinking in a potentially-parallel way)


The simplest corresponding implementation is probably a mutex.
Multithreading is ''usually'' non-parallel concurrency within a process.
A thread is still defined to




Monitor structures can sometimes be more convenient.
-->


====Races====
One of the main reasons these are fuzzy to us because the OS abstracts this out in some way.
{{stub}}
You give it unique tasks, it makes evaluation happen.


The abstraction at the OS level is, in broad terms, absolutely ''great'', because you then don't have to learn what exactly [[pre-empting]] and [[cooperative multitasking]] and [[NUMA]] and [[SMP]] {{comment|(and the vague, incorrect use of the last)}} means before you can use the PC.


A '''race condition''', a.k.a. a race, is when the behaviour of software (or electronics) depends on the timing of events outside of your control.
Long time ago, parallelism on a single PC didn't even used to be a thing.
There was quickly interleaved time-sharing, but if you wanted parallelism, you better have multiple computers.


Often, "it depends on which part got there first, and there's no telling which one that will be".
Yet if
: tasks were made to be independent anyway, running them ''at the same point in time'' means things can get done in less wallclock time.
: you have independent execution cores (and this has been reality for two dozen years now)
then things can very much be run in parallel.


That said, parallelism takes some work to do to chunk up the task,
and if the chunks actually rely on each other, returns diminish ''fast''.


For example, when multiple concurrent tasks (where having tasks be independent is part of the ''point'') are accessing the same data (be it files, shared memory, or whatnot),
We hope for [[embarassingly parallel]] tasks: completely independent.
in a non-isolated / non-atomic way, what will happen depends highly on the timing (and thereby ordering) of the operations of both.


No matter the year, no matter how many cores.
In fact, designs that have a ''lot'' of cores tend to have each core be slower,
so the slower a sequential task will execute.


Most code is written to assume it is alone and that the data can't change while it works on it. 




Particularly around [[pre-emptive multi-tasking]], that is an incorrect assumption. There is little telling where in the sequence of operations a task switch occurs, so there are ''many'' variants of interactions, making the possible errors hard to predict, or test for.
A bit more practically, take Javascript - it does concurrency, but the language definition does not allow parallel execution at all.
It can't be hacked in without violating what the programs expressed in the language even ''mean''.  
The best we can do is back to the embarassingly parallel: put a competely distinct job in a web worker,
for the browser to actually so independently that it ''can'' actually go to a different core.




Depending on what is actually getting done, the result is anything from
the result being incorrect (e.g. two threads adding 1 to a counter just means the count will be too low), to
a total system crash.


'''Relation to multitasking'''


<!--
In concurrency without parallelism, only one task runs, and the switching is usually a [[cooperative multitasking]] thing: the task gives up control explicitly.
* the result may be merely somewhat incorrect, say, two threads incrementing a counter:
Thread A reads counter C, value is 20
Thread B reads counter C, value is 20
Thread A adds 1 locally (=21)
Thread B adds 1 locally (=21)
Thread A stores back value 21 into C
Thread B stores back value 21 into C
So, 20+1+1 = 21. Great.


Largely because that's easier to implement.


...to the program crashing (due to state now violating [[invariants]])
It also makes it easier for the tasks to stop only in specific places where it knows state is safe.
after corrupting stored data (e.g. because two processes wrote data to the same shared file handle, or it was corrupted in memory and then written out).
Yet that means it'll postpone that, meaning there is no guarantee of how snappy anything is on the system
-->
unless you control all its parts.




Race conditions are most commonly solved by making a piece of code, or a resource, usable by only one thread at a time.  
General purpose PCs care more about snappiness of everything,  
and we end up with [[pre-emptive multitasking]] that means tasks can be switched ''at any time'',
and is more complex in that you need a separate scheduler to do that.


This is usually done with mutexes or semaphores. (...or critical sections, which suspend all threads but the executing one regardless of what they are doing. Which in ways is more of a a hack than a solution).


For example,
* ''Multiprocessing'' started out on single-core, and was more about concurrency.
: As we grew more cores, it was also useful for parallelism.
: today, cores don't really get faster, we get more cores


If you do a lot of state sharing, note that there are ways to opimize this. For example, [[lock-free data structures]] are often also race-free with in ways that have less effect on performance than "lock everything" style solutions.
* ''Multithreading'' is to this day still mostly about concurrency, much less about parallelism.
: except that [[win32 threads]] act more parallel than [[posix threads]] {{verify}}
: is asynchronous as well (though you can break)




https://en.wikipedia.org/wiki/Race_condition
* ''Async IO'' approaches are used only for concurrency, not parallelism
: because it can be lighter e.g. when you tiny tasks, or potentially ''very many''


https://wiki.c2.com/?RaceCondition


====Thread-(un)safe====
{{stub}}


Thread safety, intuitively, asks "can you access the same thing from multiple threads without things breaking ''because'' you did that".
For example,
* JavaScript gives you zero parallelism (it's single threaded),
: which means there is a lot of focus on concurrent ways of programming
: there are extensions that allow parallelism, added by node, and the browser (though they are isolated to not break JS's language model)
 
 
 
 
 
 
 
'''So what is this 'async IO' in that context, and when is it useful?'''
 
See the next section.
 
 
-->
 
====Coroutines and cooperative multitasking====
<!--
 
tl;dr:
* coroutines is distinct code within a task-as-in-process-or-even-thread
 
* coroutines give you less than even threading - but are also typically cheaper
 
* if a language adds abstractions related to coroutinges, this tends to mean less typing and/or thinking for you
 
 
Coroutines tend to refer to a language-assisted way of doing [[cooperative multitasking]]:
more than one task can be considier to be in the middle of work,
and there is some way of handing control to them, and for them to give it back.
 
 
While pre-emptive multitasking by the OS is an easy way to get fair chunks of CPU to
hundreds to thousand of tasks, it does not scale efficiently beyond that due to the
overhead of switching different tasks onto the CPU.
 
Also, if you do multithreading instead of multiprocessing,
the process memory is shared between threads,
but threads are still unique threads of execution, so must have their own stack.
You want to start a million threads? Better have order of a gbyte of unused mappable memory.
 
Also, the "we don't know where in our work we got paused" of the mix of pre-emptive + shared
means that whenever you touch shared memory, you end up making things wait on each other to ensure correctness.
Shared memory without that is... hard, or insane. At the very least a field of study.
 
IO creates a similar problem, but because it's provided by the kernel, and fairly sequential in nature,
it does such locking-like things for you, meaning it makes ''you'' wait.
That may be correct, but you can imagine your program gets held up a bunch,
and without knowledge of what bits can continue, the ''entire'' program does
(threading helps a little, as having a thread currently wait on a blocking syscall
is something that the scheduler can notice. But if you had other threads be locking,
you may actually be indirectly blocking work from getting done in other threads ''anyway'').
 
 
Cooperative multitasking is in part about defining tasks in a way where
it is clearer what can continue, and why.
 
Not in a complex declarative way. Just by the nature of thinking about tasks a little differently.
Differently enough that it goes against your habits at first,
so it's a change in pace and thinking,
but once you get used to the difference, it's not very scary.
 
But frankly, people suck at explaining the difference.
 
 
 
 
Cooperative multitasking tends to mean just a single queue of tasks.
If there's a million on there, that's just data. It'll take longer to finish,
but assuming it's in a single process, or a few, there is almost no task switching overhead.
 
Additionally, IO can be intuitively clearer, in that a task tends not to yield until it finished a well-defined piece of work,
leaving things in a state that, even if other tasks touch the same thing, is easier to handle.
 
 
We tend to use cooperative multitasking when
* we expect a lot of very short-term tasks to happen
:: whether that is a very large queue of tasks
:: ...or just a game loop
 
And more so when parts of that do IO.
 
 
One reason it's not the standard way of coding is a little more thinking, work,
and that there are more consequences if a task is not well behaved.
 
 
 
 
 
Because the idea of handing control over to each other is great from a theoretic distance,
but reality also has to contend with IO.
{{comment|(...in fact with ''any'' blocking that happens outside our control, but in practice you can design mostly of that away -- except IO and some other syscalls)}}
 
If you do coroutinges inside the way OSes typically do processes/tasks,
then ''any'' blocking'' IO would still halt the current process/thread,
and thereby ''everything'' inside it
(remember what cooperative means in [[cooperative multitasking]]).
 
 
From a wider view, you ''have'' to have an IO API that works asynchronous-style
for cooperative multitasking to make, well, ''any'' practical sense to even attempt.
 
The rest is sor of up to you to do correctly, but this part is non-optional.
 
 
 
 
Also note that in itself, '''cooperative multitasking''' ''only'' gives you concurrency, not parallelism'''.
 
And that multiprocessing also give you concurrency, but whether it can get evaluated in parallel depends on the hardware (and OS).
 
And that multithreading also give you concurrency, but whether it can get evaluated in parallel depends on both the threading model and the hardware (and how (in)efficient it might be further depends on the specific hadware).
 
 
'''So, practically why choose cooperative multitasking?'''
 
When you will typically have a ''very large'' amounts (thousands+) of all-tiny tasks, then
: you probably need only concurrency, not parallelism
: the overhead of starting that many processes, or even threads, starts getting prohibitive amounts of overhead (via at least kernel bookkeeping)
 
Cooperative multitasking is, in the end, just data in a process a loop looks at (quite comparable to an event loop and possibly implemented that way),
so while the abstraction gives you less, it also costs less.
 
 
So yeah, use async only when you know it's a good solution.
 
 
-->
=====async and await=====
 
{{inlinecode|async}} and {{inlinecode|await}} are syntactic sugar present in some languages,
that often mean you're doing cooperative multitasking -- coroutine style programming.
 
...where the interpreter/compiler will be helping you judge correctness.
 
<!--
Things get a little more interesting if you mix these functions into code that is ''not'' - which often comes down to something like 'use await to make noon-blocking calls act like a blocking call anyway'.
-->
 
 
See e.g.
* [[Python_usage_notes_-_concurrency#async.2C_await]]
 
* [[Javascript_notes_-_syntax_and_behaviour#async.2C_await]]
 
===Glossary of exclusiveness and messiness===
 
====Atomic operations====
 
<!--
Atomic means something is done in one uninterruptible step, so
: given an operation to change state, either all of changes contained in that operation will happen, or none of them.
The change is makes is somehow made indivisible and irreducible.
 
 
Few things are atomic (even at machine code level, because most useful code is still read-modify-write) so atomicity is often introduced alongside isolation.
Isolation of a particular operation means that when one actor alters state, another actor will only ever be able to see the old state and the finished state, nothing inbetween.
 
 
If they sound very similar, they are.
It's useful to introduce both because while there are solutions that guarantee both at the same time, others may do only one.
 
''Both'' matter to safety, and avoiding race conditions between threads/processes, database [http://en.wikipedia.org/wiki/ACID ACID]ity), and more.
 
 
 
Atomicity can be applied at various levels of complexity. For example
* A single CPU operation is atomic by nature
 
* assignments of variable usually are
 
* medium-level data structures may want to add some guarantees, e.g. that 'append this element to the list' is atomic
:: That append will involve multiple smaller operations and you don't want another thread to read the list in a half-updated state
 
* many filesystem ''metadata'' alterations are atomic updates (filenames, etc.)
 
* a database system's transaction (multiple operations) either all completes, or none of it completes.
: Those who work on databases can relate this to ACIDity, specifically to atomicity ''and'' isolation. Consider how transactions appear to others only when you commit, and are not seen when you roll back.
: Say that buying a plane ticket online consists entirely of paying for it and reserving the seat. If you effectively make this change atomic, you won't have phone calls dealing with all the edge cases.
 
Making higher-level code atomic often relies on lower-level atomicity.
 
-->
 
====Critical section====
{{stub}}
<!--
A critical section is a a piece of code that should be executed by at most one thread of execution at a time, because it accesses a resource that is effectively shared (data, IO, whatnot).
 
 
Critical sections are an ''idea''.
 
How exactly this is done (...without structural inefficiencies...) varies per architecture - of hardware, OS-level abstractions, and more.
 
The simplest corresponding implementation is probably a mutex.
 
 
Monitor structures can sometimes be more convenient.
-->
 
====Races====
{{stub}}
 
 
A '''race condition''', a.k.a. a race, is when the behaviour of software (or electronics) depends on the timing of events outside of your control.
 
Often, "it depends on which part got there first, and there's no telling which one that will be".
 
 
For example, when multiple concurrent tasks (where having tasks be independent is part of the ''point'') are accessing the same data (be it files, shared memory, or whatnot),
in a non-isolated / non-atomic way, what will happen depends highly on the timing (and thereby ordering) of the operations of both.
 
 
Most code is written to assume it is alone and that the data can't change while it works on it. 
 
 
Particularly around [[pre-emptive multi-tasking]], that is an incorrect assumption. There is little telling where in the sequence of operations a task switch occurs, so there are ''many'' variants of interactions, making the possible errors hard to predict, or test for.
 
 
Depending on what is actually getting done, the result is anything from
the result being incorrect (e.g. two threads adding 1 to a counter just means the count will be too low), to
a total system crash.
 
 
<!--
* the result may be merely somewhat incorrect, say, two threads incrementing a counter:
Thread A reads counter C, value is 20
Thread B reads counter C, value is 20
Thread A adds 1 locally (=21)
Thread B adds 1 locally (=21)
Thread A stores back value 21 into C
Thread B stores back value 21 into C
So, 20+1+1 = 21. Great.
 
 
...to the program crashing (due to state now violating [[invariants]])
after corrupting stored data (e.g. because two processes wrote data to the same shared file handle, or it was corrupted in memory and then written out).
-->
 
 
Race conditions are most commonly solved by making a piece of code, or a resource, usable by only one thread at a time.
 
This is usually done with mutexes or semaphores. (...or critical sections, which suspend all threads but the executing one regardless of what they are doing. Which in ways is more of a a hack than a solution).
 
 
If you do a lot of state sharing, note that there are ways to opimize this. For example, [[lock-free data structures]] are often also race-free with in ways that have less effect on performance than "lock everything" style solutions.
 
 
 
There are some libraries that cab help you make code free from race conditions even with threading.
There are even programming languages with semantics that do this (mostly).
They still tend to have some tradeoff between simple to understand / fast / other things you may care about.
 
 
 
 
https://en.wikipedia.org/wiki/Race_condition
 
https://wiki.c2.com/?RaceCondition
 
====Thread-(un)safe====
{{stub}}
 
Thread safety, intuitively, asks "can you access the same thing from multiple threads without that being the reason things break?".
 
 
It is technically more precise to call this concurrency-safe, because this can happen anytime you have more than one 'thread of execution', in the generic-rather-than-just-multihreading sense.
 
That said, it is ''so'' much easier to break things in more magical ways with threads than with most any other type of concurrency.
Not because of threads per se, but because of what you do: start with shared everything, zero safety, and begrudgingly add a bit of protection.
Other methods share less, or tell you what good style is. Threads just throw you in the deep end.
 
 
 
This is primarily about
: state belonging to each thread of execution should stay isolated to that thread (as much as you sensibly can)
: any state you ''must'' share between threads will be accessed in a sequential way (or otherwise safe)
 
 
This includes, but is not limited to,
cases where two threads call the same code, and that code keeps its own state - see [[re-entrant]] code for relevant issues.
 
<!--
 
The mechanisms of making things thread-safe vary a little.
 
For example, the way object/state accesses can be made thread-safe
are practically often different from the way a library's API can be.
 
 
Whenever it is logical that something is not unique (two threads intentionally sharing object/state), it tends to refer to whether
: other threads act properly when shared state is changed
:
 
 
Because various types of threading can be context-switched ''at any place'' in their code,
it is often relevant whether two different threads can access the same data.
If this is not atomic,
 
 
 
 
And sometimes the
 
 
This is in part about sufficiently serialized / atomized operations.
 
 
For example, MATLAB's plotting API (and the part of [[matplotlib]] that imitates that API) can never be thread-safe,
not because of complex hidden race conditions, but because it intentionally keeps "plot we are currently working on" in global state to make the commands a little less typing.
 
Matplotlib adds an object oriented API that is thead-safe in design, because you explicitly only alter the state in that object {{comment|(though it still has some thread-unsafe edges you should take care of)}}.
 
 
 
For a variable, a lack of thread safety can mean it acts something like a cross-thread global.
 
For example, the variable behind python's <tt>socket.settimeout()<tt> is not thread-safe, in that it is effectively a volatile global: a thread that sets the value can't assume it won't change, and in general doing so will not cause race conditions, only connection attempts to take shorter/longer than expected, and may look somewhat non-deterministic, but that's the extent of the problem. {{comment|(and this itself was only really a problem while you could only get timeouts into a socket via this default value. You can now hand timeouts to specific sockets)}}
 
 
Often, the problem is more serious.
-->
 
====Deadlock, livelock====
 
Deadlock and livelock are both situations where the locking happening in two or more threads is the reason things are blocked from getting done.
 
 
A '''deadlock''' is when threads end up waiting for the other(s), indefinitely.
 
The typical example is one with two locks, where the order of locking them is the problem:
 
Consider
: two threads, call them P1 and P2
: two locks (protecting two resources), call them A and B
: the code in both threads is "lock one, lock the other, do work, unlock both"


{{comment|(Note: it is technically more precise to call this concurrency-safe, because this can happen anytime you have more than one 'thread of execution', in the generic-rather-than-just-multihreading sense. That said, the cases we ''usually'' deal with is multithreading - and shared memory between processes)}}
Because the timing of what the threads do is unrelated, a lot of the possible orders can happen. You could write them out of you want a sense of completeness, but perhaps most interesting is one of the problem cases:
: P1 locks A, nothing wrong with that
: P2 locks B, nothing wrong with that
: P2 locks A - which is currently already locked so we start waiting (for P1) to release it
: P1 locks B - which is currently already locked so we start waiting (for P2) to release it


There is nothing wrong with each individual action,
except that in this specific sequence of events,
: both are now waiting indefinitely on the other, before they can start work.
: neither will release locks until they're done with that work (which we just mentioned neither will start with)
: This can never be resolved, so both will wait indefinitely - this is a deadlock.




This is primarily about
: state belonging to each thread of execution should stay isolated to that thread (as much as you sensibly can)
: any state you ''must'' share between threads will be accessed in a sequential way (or otherwise safe)


There are ways to detect deadlocks ahead of time.


This includes, but is not limited to,
cases where two threads call the same code, and that code keeps its own state - see [[re-entrant]] code for relevant issues.


<!--
There are other ways to avoid them that are sometimes simpler.


The mechanisms of making things thread-safe vary a little.  
...including ''less'' finely grained locks.  


For example, the way object/state accesses can be made thread-safe
Consider a single lock that protects ''both'' mentioned resources at the same time.
are practically often different from the way a library's API can be.


For the above case ("lock one, lock the other, do work, unlock both") this is actually equivalent -- the processes make their work purely sequential


Whenever it is logical that something is not unique (two threads intentionally sharing object/state), it tends to refer to whether
In the general case, it is easily less efficient to use one big lock, because things that don't even touch the resources may now be waiting on each other.
: other threads act properly when shared state is changed
:


Yet it's better than hanging forever.


Because various types of threading can be context-switched ''at any place'' in their code,
And often it's not ''too'' hard to think something up that is safe but doesn't defeats the concurrency ''entirely''.
it is often relevant whether two different threads can access the same data.  
If this is not atomic,




Other deadlocks are resolved for you. Say, relational databases will typically contain a lot of logic that basically means that if two transactions happen to deadlock, one will be rolled back.




And sometimes the


''Specific'' deadlock situations may be easy enough to detect happening ahead of time, but this require knowledge of the situation, its locks, and how to write such checks without race conditions.


This is in part about sufficiently serialized / atomized operations.


Sometimes there's a cheat in locks with timeouts and make your code retry. ...but this doesn't scale well at all, and even at smaller scale can cause other problems, like livelocks instead.


For example, MATLAB's plotting API (and the part of [[matplotlib]] that imitates that API) can never be thread-safe,
not because of complex hidden race conditions, but because it intentionally keeps "plot we are currently working on" in global state to make the commands a little less typing.


Matplotlib adds an object oriented API that is thead-safe in design, because you explicitly only alter the state in that object {{comment|(though it still has some thread-unsafe edges you should take care of)}}.
When you get to design the locking (e.g. in your own programs, rather than in a given database),
consider different approach to locking.




There has been plenty of research into lock-free data structures, cleverer lock relief, and such.


For a variable, a lack of thread safety can mean it acts something like a cross-thread global.


For example, the variable behind python's <tt>socket.settimeout()<tt> is not thread-safe, in that it is effectively a volatile global: a thread that sets the value can't assume it won't change, and in general doing so will not cause race conditions, only connection attempts to take shorter/longer than expected, and may look somewhat non-deterministic, but that's the extent of the problem. {{comment|(and this itself was only really a problem while you could only get timeouts into a socket via this default value. You can now hand timeouts to specific sockets)}}




Often, the problem is more serious.
-->


====Deadlock, livelock====
A '''livelock''' refers to situations where there ''is'' regular locking and unlocking going on, but the system is trapped in a limited subset of its states and it may not get out of this subset within acceptable time, or possibly even at all.


Deadlock and livelock are both situations where the locking in two or more threads will block things from getting done.
Often these states are functional/bookkeeping states between actual work, so it's not doing anything useful either.
Meaning you can see it as a type of starvation, and arguably ''practically'' equivalent to a deadlock.  


It can happen when the concurrency model is loose and/or complex, but exclusive locking is still an important part of it.


A '''deadlock''' is when threads end up waiting for the other(s), indefinitely.
Livelocks may be harder to predict, detect, and may depend on secondary effects like timing and amount of load.


The typical example is one with two locks, where the order of locking them is the problem:
<!--
For example, consider trying to fix the above deadlock situation altered with "if you can't lock within 1 second, then unlock and try the whole thing again later". This may work, but with unlucky situation (mostly relating to timing and load) you may see a lot of
-->


Consider
<!--
: two threads, call them P1 and P2
The common example for deadlocks is two resources call them R1 and R2, each with their own lock.
: two locks (protecting two resources), call them A and B
In itself this is a fine construction to safely work on each effectively atomically.
: the code in both threads is "lock one, lock the other, do work, unlock both"


Because the timing of what the threads do is unrelated, a lot of the possible orders can happen. You could write them out of you want a sense of completeness, but perhaps most interesting is one of the problem cases:
The potential problem comes in when a thread needs both resources in combination (say, because they are related).
: P1 locks A, nothing wrong with that
Another thread will probably want the same. Call these two threads T1 and T2.
: P2 locks B, nothing wrong with that
: P2 locks A - which is currently already locked so we start waiting (for P1) to release it
: P1 locks B - which is currently already locked so we start waiting (for P2) to release it


There is nothing wrong with each individual action,
Say that
except that in this specific sequence of events,
: T1 locks R1
: both are now waiting indefinitely on the other, before they can start work.
: T2 locks R2
: neither will release locks until they're done with that work.
and both are happily working -- independent and simple so far.
: This can never be resolved, so both will wait indefinitely - this is a deadlock.


Next
: T1 tries wants R2, tries to lock it, and therefore starts waiting for T2 to release it.
Still fine - it's not doing anything so may be idle CPU, but it's safe.


Now
: T2 was working until now and now wants to lock on R1. It does so, and therefore waits for T1 to release it.


There are ways to detect deadlocks ahead of time.
If the lock-aquiring call wait until it was released, both processes ar enow in a state will never do anything else as long as they live.




There are other ways to avoid them that are sometimes simpler.
The [http://en.wikipedia.org/wiki/Dining_philosophers dining philospophers] problem is another common illustration of deadlock, starvation, and some possible solutions, depending on how it is explained.


...including ''less'' finely grained locks.


Consider a single lock that protects ''both'' mentioned resources at the same time.


For the above case ("lock one, lock the other, do work, unlock both") this is actually equivalent -- the processes make their work purely sequential
A big problem (on the theory side) is that deadlocks are a property of the whole system,
not of any individual worker in it.


In the general case, it is easily less efficient to use one big lock, because things that don't even touch the resources may now be waiting on each other.
A big problem on the practical side is that we can often get away with not caring, especially in systems that detect them, like RDBMSes - and things are faster that way until you hit the problem (and even then it's sometimes faster just to make that client retry).


Yet it's better than hanging forever.
You can get an intuition for both with the dining philosophers:
 
You can only see the problem when you have a complete overview, not while you're fighting with your direct neighbour.
And often it's not ''too'' hard to think something up that is safe but doesn't defeats the concurrency ''entirely''.
And assuming they'll grab for the other side if one side is taken, you won't have a problem even without overview, until absolutely all want to eat at the same time. So yes, it's tempting to ignore them until they're a real problem.




Other deadlocks are resolved for you. Say, relational databases will typically contain a lot of logic that basically means that if two transactions happen to deadlock, one will be rolled back.
When deadlocks can happen, you have a few options:
* avoid locks on things that depend on each other, in the example above using a lock for 'R1 ''or'' R2 is being used'
: doing this can mean a lot of overzealous locking
: ...and lead to to the system not using its resources


* allow your lock() call to timeout
: often means you'll want code like 'while the lock was not aquired, wait and try agin'
: basically backoff of the most simple type (arbitrary timing/criterion)
: may lead to live locks
: may lead to idle (unlocked) time
: ...and imply to the system not using its resources very efficiently


* add checks to make a thread back off if it thinks it may cause a deadlock
: can be a decent solution for two threads, but not easily for more (easily livelocks)


''Specific'' deadlock situations may be easy enough to detect happening ahead of time, but this require knowledge of the situation, its locks, and how to write such checks without race conditions.
* add checks to make sure they never happen
: which sometimes is easy enough, but sometimes isn't - your need to predict what would happen, which with circular dependencies can be complex.


* use a lock manager that can detect deadlocks (and resolve it by denying threads - which your threads most able to deal with)
: often means overhead on locking, and that you have to use this system for locking everywhere
: for important/central resources it can certainly be worth it


Sometimes there's a cheat in locks with timeouts and make your code retry. ...but this doesn't scale well at all, and even at smaller scale can cause other problems, like livelocks instead.
-->


====Starvation====
<!--
A situation where, pragmatically speaking, work ''could'' get done, but it isn't, or it is happening at an unreasonably slow rate.


When you get to design the locking (e.g. in your own programs, rather than in a given database),
You may be able to see that a thread is having trouble getting or starting a task, or sometimes to register it as complete.  
consider different approach to locking.


The reason behind that is typically the perpetual denial of a resource - for whatever reason. This is usually due to resource locking - sometimes too coarse, sometimes fine-but-interdependent, sometimes it's bad timing or a livelock-like situation to the point where it can freeze a program almost indefinitely.


There has been plenty of research into lock-free data structures, cleverer lock relief, and such.  
See [http://en.wikipedia.org/wiki/Dining_philosophers dining philospophers]. Starvation is probably named after what would happen to them.




Starvation is often avoidable, given some consideration of what an implementation does in different situations - in particular under load.


For example, there are several semaphore implementations; some will never cause starvation, in others it is possible.


-->


A '''livelock''' refers to situations where there ''is'' regular locking and unlocking going on, but the system is trapped in a limited subset of its states and it may not get out of this subset within acceptable time, or possibly even at all.
====(Lock) contention, false contention, lock convoys====
<!--
{{stub}}


Often these states are functional/bookkeeping states between actual work, so it's not doing anything useful either.
'''Contention''' in general refers to conflict over the use of a resource.
Meaning you can see it as a type of starvation, and arguably ''practically'' equivalent to a deadlock.  


It can happen when the concurrency model is loose and/or complex, but exclusive locking is still an important part of it.


Livelocks may be harder to predict, detect, and may depend on secondary effects like timing and amount of load.
'''Lock contention''' refers to cases where a lock protects use of a resource, and the potential users of the resource are mostly waiting on a lock, rather than doing something useful with the resource.


<!--
For example, consider trying to fix the above deadlock situation altered with "if you can't lock within 1 second, then unlock and try the whole thing again later". This may work, but with unlucky situation (mostly relating to timing and load) you may see a lot of
-->


<!--
The common example for deadlocks is two resources call them R1 and R2, each with their own lock.
In itself this is a fine construction to safely work on each effectively atomically.


The potential problem comes in when a thread needs both resources in combination (say, because they are related).
A '''lock convoy''' refers to situations where lock contention caused a lot of workers to block on a lock, meaning there is a long line of halted workers.
Another thread will probably want the same. Call these two threads T1 and T2.


Say that
It points specifically at cases where processes do get scheduled, but when they do they can only conclude they can't get the lock and have to give up -- the fact that you may have hundreds of threads getting scheduled with no result gives lots of extra load that slows things down. (...while still collectively progressing at all (just very little), so not being a livelock situation)
: T1 locks R1
: T2 locks R2
and both are happily working -- independent and simple so far.


Next
: T1 tries wants R2, tries to lock it, and therefore starts waiting for T2 to release it.
Still fine - it's not doing anything so may be idle CPU, but it's safe.


Now
: T2 was working until now and now wants to lock on R1. It does so, and therefore waits for T1 to release it.


If the lock-aquiring call wait until it was released, both processes ar enow in a state will never do anything else as long as they live.
''If'' the resource must have at most one user for safety reasons,
then this is basically what you want, or at least expect,
and you just have to accept the maximum rate you can do it with this precise setup.




The [http://en.wikipedia.org/wiki/Dining_philosophers dining philospophers] problem is another common illustration of deadlock, starvation, and some possible solutions, depending on how it is explained.
Lock contention is something you can improve when
* the overhead of locking (or the scheduling around it) is larger than necessary, also meaning the resource behind the lock isn't actually being saturated
: on linux, this is a reason to look at using a [[futex]] over a mutex


* acquiring a lock earlier than necessary / releasing it later than necessary makes that rate lower than it could be


* a program is blocking on this resource, but actually ''could'' have been doing other, unrelated work instead


A big problem (on the theory side) is that deadlocks are a property of the whole system,
not of any individual worker in it.


A big problem on the practical side is that we can often get away with not caring, especially in systems that detect them, like RDBMSes - and things are faster that way until you hit the problem (and even then it's sometimes faster just to make that client retry).


You can get an intuition for both with the dining philosophers:
Lock contention is a serious problem when  
You can only see the problem when you have a complete overview, not while you're fighting with your direct neighbour.
* it leads to / causes a livelock
And assuming they'll grab for the other side if one side is taken, you won't have a problem even without overview, until absolutely all want to eat at the same time. So yes, it's tempting to ignore them until they're a real problem.


* it leads to / causes a deadlock


When deadlocks can happen, you have a few options:
* avoid locks on things that depend on each other, in the example above using a lock for 'R1 ''or'' R2 is being used'
: doing this can mean a lot of overzealous locking
: ...and lead to to the system not using its resources


* allow your lock() call to timeout
: often means you'll want code like 'while the lock was not aquired, wait and try agin'
: basically backoff of the most simple type (arbitrary timing/criterion)
: may lead to live locks
: may lead to idle (unlocked) time
: ...and imply to the system not using its resources very efficiently


* add checks to make a thread back off if it thinks it may cause a deadlock
: can be a decent solution for two threads, but not easily for more (easily livelocks)


* add checks to make sure they never happen
: which sometimes is easy enough, but sometimes isn't - your need to predict what would happen, which with circular dependencies can be complex.


* use a lock manager that can detect deadlocks (and resolve it by denying threads - which your threads most able to deal with)
: often means overhead on locking, and that you have to use this system for locking everywhere
: for important/central resources it can certainly be worth it


-->


====Starvation====
<!--
A situation where, pragmatically speaking, work ''could'' get done, but it isn't, or it is happening at an unreasonably slow rate.


You may be able to see that a thread is having trouble getting or starting a task, or sometimes to register it as complete.  
'''False contention''' usually refers to ''unnecessary'' contention, say, two processes locking an overall resource, when they access nonconflicting parts and ''could'' know about that.


The reason behind that is typically the perpetual denial of a resource - for whatever reason. This is usually due to resource locking - sometimes too coarse, sometimes fine-but-interdependent, sometimes it's bad timing or a livelock-like situation to the point where it can freeze a program almost indefinitely.
Can mean the locking is too course-grained, although the solution isn't always simple: false contention is not always predictable, and-fine grained locking may lead to different and sometimes much more significant problems, including deadlocks.


See [http://en.wikipedia.org/wiki/Dining_philosophers dining philospophers]. Starvation is probably named after what would happen to them.


For example, a table-based database locking an entire table instead of the one row it will change is likely to lead to contention on the table lock and the table itself - and, on busy tables, a lock convoy.


Starvation is often avoidable, given some consideration of what an implementation does in different situations - in particular under load.
If there were a mechanism to lock rows instead of tables, you could also call this false contention (except that there isn't - and row locking is often implied)


For example, there are several semaphore implementations; some will never cause starvation, in others it is possible.


-->


====(Lock) contention, false contention, lock convoys====
<!--
{{stub}}


'''Contention''' in general refers to conflict over the use of a resource.


Whether this is avoidable or not depends on the specific system, its overall design, its mechanism of locking, whether your code has other things to do, and more.


'''Lock contention''' refers to cases where a lock protects use of a resource, and the potential users of the resource are mostly waiting on a lock, rather than doing something useful with the resource.




See also:
* http://en.wikipedia.org/wiki/Lock_convoy
-->


A '''lock convoy''' refers to situations where lock contention caused a lot of workers to block on a lock, meaning there is a long line of halted workers.
====Thundering herd====
{{stub}}
<!--


It points specifically at cases where processes do get scheduled, but when they do they can only conclude they can't get the lock and have to give up -- the fact that you may have hundreds of threads getting scheduled with no result gives lots of extra load that slows things down. (...while still collectively progressing at all (just very little), so not being a livelock situation)
In the widest sense, issues that arise when everyone trying to do something at exactly the same time.




For example, of you restart a database, it may be that a thousand clients want to reconnect within the same second or so - that's likely to make the server churn very hard for a few seconds until all are handled.


''If'' the resource must have at most one user for safety reasons,
then this is basically what you want, or at least expect,
and you just have to accept the maximum rate you can do it with this precise setup.


Network servers may use select() to check "is there at least one new connection on a socket?" ''and'' may use multiple worker threads, meaning each will do that select().


Lock contention is something you can improve when
* the overhead of locking (or the scheduling around it) is larger than necessary, also meaning the resource behind the lock isn't actually being saturated
: on linux, this is a reason to look at using a [[futex]] over a mutex


* acquiring a lock earlier than necessary / releasing it later than necessary makes that rate lower than it could be


* a program is blocking on this resource, but actually ''could'' have been doing other, unrelated work instead
A more common example, is when a large amount of clients waiting on the same event, and all of them are woken at the same time.


This is potentially ''worse'' than the previous example, in that often only one of them is allowed to handle the resource at a time, meaning that all clients will ''continue'' to badger you until they've all had a turn.


For example, say that on one host, many processes all waiting on the same file, mutex, or such.
That event will make the kernel schedule all listening processes, more or less at once.


Lock contention is a serious problem when
Depending on their code, they may be bottlenecked all trying to get access to the same thing,
* it leads to / causes a livelock
creating a large amount of CPUwork and large number of system calls,
which in most processes will lead to doing nothing ''yet'',
and will probably only calm down after each is finished (or decides to give up).


* it leads to / causes a deadlock
Also note that most of these processes might be entirely blocked from being scheduled (because of blocking IO).
This is often the right thing to do, but also a perceivable and often-avoidable slowdown.






A similar issue happens in distributed systems - varying a lot with how you do serialization (and locking, if applicable).


Consider a popular social media account getting hundreds of thousands of likes within a minute. Both writing and reading out the current like count would create a pileup of connections towards the database (a classical database will serialize each addition, as part of its consistency management).








'''False contention''' usually refers to ''unnecessary'' contention, say, two processes locking an overall resource, when they access nonconflicting parts and ''could'' know about that.
Alleviations:
* backoff - if you knew you are easily part of a thundering herds, consider exponential backoff and/or random backoff, which means clients themselves spread their accesses over reasonable time
: crude, but moderately effective


Can mean the locking is too course-grained, although the solution isn't always simple: false contention is not always predictable, and-fine grained locking may lead to different and sometimes much more significant problems, including deadlocks.
* Caches
: if you can use a cached result rather than going for the source, that can offload a lot of work
: however
:: consider that if 'going to db, then storing in cache' takes a second, then after a restart, all requests within the first second are the worst case you will see.
::: [https://instagram-engineering.com/thundering-herds-promises-82191c8af57d this] seems to suggest having an in-backend Promise so that many workers can wait on the same code to complete and return.
:: consider that if a cache fails and the fallback is to go straight to the database, then this is almost guaranteed to overload your db. This is a design pitfall that means some also call this a '''cache stampede'''.




For example, a table-based database locking an entire table instead of the one row it will change is likely to lead to contention on the table lock and the table itself - and, on busy tables, a lock convoy.
* specific-case solutions
: in the many-likes example above, you could instead gather likes in memory, and e.g. send them to the database only every few seconds. This reduces connection overhead to "not the issue anymore", and it's perfectly acceptable for that count to only catches up within half a minute or so.
:: note this means you can no longer really acknowledge the action, but for likes, that's acceptable.
: some creative solutions to that like "normally I'd give you the live version, but load on this item is currently high, I'll give you a stale version instead"


If there were a mechanism to lock rows instead of tables, you could also call this false contention (except that there isn't - and row locking is often implied)
-->


====Locks and lock-like implementations====
{{stub}}
<!--
The following are mostly synchronization primitives, though their exact purpose and use differs.


* Semaphore - basically, a counter which has some special behaviour at call time


* Mutex - A lock struct
:: in many ways can be seen as a binary semaphore, although the exact details differ. In particular, a mutex is a locking mechanism, whereas a semaphore is more of a signaling mechanism


* Recursive mutex - a variant where a thread locking the mutex more than once does something useful (a simpler mutex would just deadlock, or maybe refuse)


Whether this is avoidable or not depends on the specific system, its overall design, its mechanism of locking, whether your code has other things to do, and more.
* Futex - A mutex-like construction which does (almost) all operations in userspace.
:: Useful when you need fine grained locking and want to avoid the overhead of system calls implied in kernel-space mutexes.


* Spinlock


* Wait queue


See also:
* http://en.wikipedia.org/wiki/Lock_convoy
-->


====Thundering herd====
{{stub}}
<!--


In the widest sense, issues that arise when everyone trying to do something at exactly the same time.


For example, a database restart meaning a thousand clients want to reconnect within the same second or so - that's likely to make the server churn very hard for a few seconds until all are handled.
Patterns, language constructs


* Monitor - A set of routines where only one can execute at a time.
Often means a mutex plus some glueing code.


A more common example, is when a large amount of clients waiting on the same event, and all of them are woken at the same time.
* Completion


This is potentially ''worse'' than the previous example, in that often only one of them is allowed to handle the resource at a time, meaning that all clients will ''continue'' to badger you until they've all had a turn.


For example, say that on one host, many processes all waiting on the same file, mutex, or such.
That event will make the kernel schedule all listening processes, more or less at once.


Depending on their code, they may be bottlenecked all trying to get access to the same thing,
creating a large amount of CPUwork and large number of system calls,
which in most processes will lead to doing nothing ''yet'',
and will probably only calm down after each is finished (or decides to give up).


Also note that most of these processes might be entirely blocked from being scheduled (because of blocking IO).
Lower level:
This is often the right thing to do, but also a perceivable and often-avoidable slowdown.
*




The kernel is highly asynchronous - that is, its state.
Things done in interrupts may cause races, so the simplest fix is <tt>cli(), your code, sti()</tt>.
However, that basically reduces your CPU to a single-core one for that time.


A similar issue happens in distributed systems - varying a lot with how you do serialization (and locking, if applicable).
Consider a popular social media account getting hundreds of thousands of likes within a minute. Both writing and reading out the current like count would create a pileup of connections towards the database (a classical database will serialize each addition, as part of its consistency management).






-->
=====Lock hierarchies=====
<!--
Lock hierarchies are one way of thinking about related locks,
and can be one way of reasoning about how to avoid deadlocks
You may be doing this already, with coarse-grained locks protecting fine-grained locks.


Alleviations:
* backoff - if you knew you are easily part of a thundering herds, consider exponential backoff and/or random backoff, which means clients themselves spread their accesses over reasonable time
: crude, but moderately effective


* Caches
: if you can use a cached result rather than going for the source, that can offload a lot of work
: however
:: consider that if 'going to db, then storing in cache' takes a second, then after a restart, all requests within the first second are the worst case you will see.
::: [https://instagram-engineering.com/thundering-herds-promises-82191c8af57d this] seems to suggest having an in-backend Promise so that many workers can wait on the same code to complete and return.
:: consider that if a cache fails and the fallback is to go straight to the database, then this is almost guaranteed to overload your db. This is a design pitfall that means some also call this a '''cache stampede'''.




* specific-case solutions
: in the many-likes example above, you could instead gather likes in memory, and e.g. send them to the database only every few seconds. This reduces connection overhead to "not the issue anymore", and it's perfectly acceptable for that count to only catches up within half a minute or so.
:: note this means you can no longer really acknowledge the action, but for likes, that's acceptable.
: some creative solutions to that like "normally I'd give you the live version, but load on this item is currently high, I'll give you a stale version instead"


-->
-->


====Locks and lock-like implementations====
====Re-entrant code====
{{stub}}
{{stub}}
<!--
The following are mostly synchronization primitives, though their exact purpose and use differs.


* Semaphore - basically, a counter which has some special behaviour at call time
'''Re-entrant code''' may be defined most by the idea that
code can be called when another call on that code is already underway
: ...re-entrant literally just refers to entering that code again.


* Mutex - A lock struct
:: in many ways can be seen as a binary semaphore, although the exact details differ. In particular, a mutex is a locking mechanism, whereas a semaphore is more of a signaling mechanism


* Recursive mutex - a variant where a thread locking the mutex more than once does something useful (a simpler mutex would just deadlock, or maybe refuse)
This can still be meant in a few ways, and you usually hear a slightly more specific term/property (though they are closely related), like:
* is safe to pause and continue execution later
* can be safely called before a previous invocation has completed
* code that can be called by multiple threads of execution
* execution that keeps only local, static, non-shared state


* Futex - A mutex-like construction which does (almost) all operations in userspace.
...most of which don't so much define what re-entrancy ''is'' {{comment|(these are e.g. also properties you care about around [[recursive functions]])}}, as it is thinking about a specific ways to (not) ''violate'' it.  
:: Useful when you need fine grained locking and want to avoid the overhead of system calls implied in kernel-space mutexes.


* Spinlock


* Wait queue
And yes, in most cases, most of what ensures re-entrancy is about not sharing state with other execution (e.g. keeping your state to parameters and local variables). But
: the case-specific details are also important for a more complete grasp.
: you sometimes ''have'' to share state (a lot of heap allocation is), and it is your code that should think about ensuring re-entrancy




'''Recursion''' may be the easiest introduction,
because it illustrates that the state of each call into a function can (and should) be kept entirely distinct from every other (each with its own scope based on its own stack frame),
and that you don't need concurrency (that you don't control) to worry about re-entrancy.




Patterns, language constructs
Another example is the C library function '''strtok''', which which lets you split a string by a delimiter.
It was written so that repeat calls give you just the next token (until there are no more),
and the original version would basically remember ''itself'' where it was in the string, making repeat calls easier.
That memory is however in global state, which means that you cannot start tokenizing a new string until you're done with the last. The variant '''strtok_r''' basically makes ''you'' remember the state (has an extra parameter). It's a little more typing, but this makes it re-entrant.


* Monitor - A set of routines where only one can execute at a time.
Often means a mutex plus some glueing code.


* Completion


But beyond that, re-entrancy is is also quality you care about
in a system that does [[concurrency]] (multitasking, threading, but also concurrency that is purely sequential and cooperative),
and is also relevant around hardware interrupts,
callbacks that may fire quickly (e.g. on a timer) and may overlap,
and some other things.


Each of those are their own topic,
because what makes code unsafe in each context is specific, as are the solutions (e.g. hardware interrupts have some unique worries).




Lower level:
*


Around concurrency, your largest worries are often
* not using shared or global state
* atomic changes, because races are an easy way to make something violate re-entrancy


The kernel is highly asynchronous - that is, its state.
Things done in interrupts may cause races, so the simplest fix is <tt>cli(), your code, sti()</tt>.
However, that basically reduces your CPU to a single-core one for that time.








-->
=====Lock hierarchies=====
<!--
<!--
There are other things you are likely to lead to trouble, like locking.
Locking (or other exclusiveness) may be one way to work around these issues - but also an easy way to introduce deadlocks instead.


Lock hierarchies are one way of thinking about related locks,  
[[#Re-entrant lock|re-entrant locks]] are a relevant, but a different concept.
and can be one way of reasoning about how to avoid deadlocks
It applies the 'safe to X' idea to locks. But it just ends up meaning that re-entrant locks can be acquired by the same thread of execution more than once without deadlocking itself.


-->


You may be doing this already, with coarse-grained locks protecting fine-grained locks.






'''Re-entrant code is intuitively related to thread safety''',
because both are about doing things safely with probably-shared state.


However,
* code can be re-entrant but not thread-safe
: in part because re-entrant can be about safety within just one thread.
* code can be thread-safe without being re-entrant
: e.g. when the violation doesn't matter - e.g. use of random_r() is important for fully reproducable results, but in a bunch of cases random() will be acceptable


-->
...or both, or neither.


====Re-entrant code====
There are some good examples here:
{{stub}}
https://stackoverflow.com/questions/856823/threadsafe-vs-re-entrant


'''Re-entrant code''' may be defined most by the idea that code can be called when a call is already underway (the name literally refers to entering that code again).


That said, a lot of code either
: doesn't have to worry about either, or
: makes it easy, or hard-but-important, to do both


...which is a wide and not hammered down very well, and you usually hear slightly more specific, varied, though in the end closely related definitions, like:
<!--
* is safe to pause and continue execution later
It is very easy to have code that is neither re-entrant or thread-safe,
* can be safely called before a previous invocation has completed
by having a function with a non-atomic change to state external to it, meaning it can race with another copy of itself.
* code that can be called by multiple threads of execution
-->
* execution that keeps only local, static, non-shared state


...most of which don't so much define what re-entrancy is, as it is thinking about a specific ways to (not) ''violate'' it.


And yes, in most cases, most of what ensures re-entrancy is about not sharing state with other execution (e.g. keeping your state to parameters and local variables). But
<!--
: the case-specific details are also important for a more complete grasp.
'''Other properties associated with re-entrancy'''
: you sometimes ''have'' to share state (a lot of heap allocation is), and it is your code that should think about ensuring re-entrancy


Code that calls non-re-entrant function is unlikely to be re-entrant.
Code that modifies itself is much less likely to be re-entrant


'''Recursion''' may be the easiest introduction,
because it illustrates that the state of each call into a function can (and should) be kept entirely distinct from every other (each with its own scope based on its own stack frame),
and that you don't need concurrency (that you don't control) to worry about re-entrancy.


Some people associate in idempotency, but this isn't a clear one.


Another example is the C library function '''strtok''', which which lets you split a string by a delimiter.
It was written so that repeat calls give you just the next token (until there are no more),
and the original version would basically remember ''itself'' where it was in the string, making repeat calls easier.
That memory is however in global state, which means that you cannot start tokenizing a new string until you're done with the last. The variant '''strtok_r''' basically makes ''you'' remember the state (has an extra parameter). It's a little more typing, but this makes it re-entrant.


Re-entrancy is one reason for encapsulation,
in that having references to other objects' state is an easier way to violate re-entrancy.


-->


But beyond that, re-entrancy is is also quality you care about
in a system that does [[concurrency]] (multitasking, threading, but also concurrency that is purely sequential and cooperative),
and is also relevant around hardware interrupts,
callbacks that may fire quickly (e.g. on a timer) and may overlap,
and some other things.


Each of those are their own topic,  
<!--
because what makes code unsafe in each context is specific, as are the solutions (e.g. hardware interrupts have some unique worries).
'''Probably, most code in general is not re-entrant'''


However, it only matters if execution makes it a problem.


Often, partitioning off your code somehow avoids re-entrancy problems.


Around concurrency, your largest worries are often
* not using shared or global state
* atomic changes, because races are an easy way to make something violate re-entrancy




-->


===Split-brain===


Split-brain describes a situation of inconsistencies between parts of a larger system.


<!--
There are other things you are likely to lead to trouble, like locking.
Locking (or other exclusiveness) may be one way to work around these issues - but also an easy way to introduce deadlocks instead.


[[#Re-entrant lock|re-entrant locks]] are a relevant, but a different concept.
It ''sometimes'' used for intentional design, e.g. private-network DNS and outside-facing DNS giving different answers.
It applies the 'safe to X' idea to locks. But it just ends up meaning that re-entrant locks can be acquired by the same thread of execution more than once without deadlocking itself.


-->
...but much more usually to cases like
* distributed services failing to keep synchronized
* synchronization not happening properly around times of failover.


* problems in failover logic, like multiple servers believing they should be the new master (often takes ''relatively'' pathological cases)




Dealing with split-brain often involves a tradeoff between availability and consistency.


'''Re-entrant code is intuitively related to thread safety''',
because both are about doing things safely with probably-shared state.


However,
https://en.wikipedia.org/wiki/Split-brain_(computing)
* code can be re-entrant but not thread-safe
: in part because re-entrant can be about safety within just one thread.
* code can be thread-safe without being re-entrant
: e.g. when the violation doesn't matter - e.g. use of random_r() is important for fully reproducable results, but in a bunch of cases random() will be acceptable


...or both, or neither.
===Glossary of scaling===
{{stub}}


There are some good examples here:
Types of scaling {{comment|(note that this has become a bit more of a continuum in the multi-core, multi-CPU, and have-GPU world)}}
https://stackoverflow.com/questions/856823/threadsafe-vs-re-entrant
* '''horizontal scaling''', a.k.a. '''scaling out''':  
: more boxes: adding more computers to do parts of a larger job
: ...either all with the same information, and a knowledge of who works on what,
: ...or '''horizontal partitioning''' , a.k.a. '''sharding''', to distribute into entirely distinct, self-contained jobs <!--
: 'scaling in' is a term used in automatic partitioning, where you stop VMs
-->




That said, a lot of code either
* '''vertical scaling''', a.k.a. '''scaling up''':
: doesn't have to worry about either, or  
: bigger boxes: adding more processing power, e.g. better/more CPUs or GPUs, to the same computer
: makes it easy, or hard-but-important, to do both
: In comparison to horizontal scaling, this tends to be
:: faster up to some point (due to local interconnects being faster than most networking - but this is also why specialized networking blurs this line)
:: often pricier for the same compute, so only worth it for some types of jobs (e.g. those that require a lot of computation ''and'' near-zero latency between parts)
::: In practice, you'd do a cost-benefit estimation of how far to scale up each node before you start scaling out
::: sometimes a little more stable, in that a single node or single network connection can't trip up large jobs
: At some point, speed of interconnection becomes the main issue.
:: particularly when you ''need'' unusually-high-speed/low-latency networking.


<!--
<!--
It is very easy to have code that is neither re-entrant or thread-safe,
* You also see horizontal partitioning versus vertical partitioning, though relating more to storage, and oten explained in terms of databases and/or relational algebra (SELECT versus PROJECT)
by having a function with a non-atomic change to state external to it, meaning it can race with another copy of itself.
: say you have a lot of users and a lot of columns and decide you want that in smaller chunks somehow
:: horizontal storage might split into lots of tables with the same schema, and split e.g. based on first lett of username
:: vertical storage would be splitting the columns between tables -- i.e. have lots of tables with all with the the same users, but have tables with specific
::: This ''can'' be done in the form of normalization, but can be simpler splitting (which can be more finicky)
::: can help performance when it splits off rarely-queried data
-->
-->




<!--
'''Other properties associated with re-entrancy'''


Code that calls non-re-entrant function is unlikely to be re-entrant.


Code that modifies itself is much less likely to be re-entrant
* '''cluster computing''' versus '''grid computing''' is not a clear distinction, though in general the associations are:
: a cluster is more homogeneous, i.e. mostly identical hardware
: a grid heterogeneous, "whatever you have"


: a cluster is often a tightly coupled, predetermined set of computers, and typically all in the same rack(s)
: a grid is often more loosely coupled and more widely spread


Some people associate in idempotency, but this isn't a clear one.
: a cluster's jobs are often centralized and synchronized,  
 
: a grid's jobs may be more more loose/flexible/dynamic/automatic (...by necessity)
 
Re-entrancy is one reason for encapsulation,  
in that having references to other objects' state is an easier way to violate re-entrancy.
 
-->


: a cluster is more likely to have a central node give jobs to the workers
: a grid is more likely to have each client ask for work from the central node


<!--
<!--
'''Probably, most code in general is not re-entrant'''


However, it only matters if execution makes it a problem.
* '''cloud computing''' is more of a management term than a technical one.
: sometimes means cluster
: but the way these are scaled means don't count on very-low latencies (e.g. AWS/Azure) (should be stable, but not as low as it could be)
: then again, high-performance computing may well be set up as a cluster, because some people need the fast synchronization


Often, partitioning off your code somehow avoids re-entrancy problems.


* '''Centralized''' tends to refer to setups where a lot of nodes report to their master - usually a single master
: e.g. scientific grids like PrimeNet, Seti@Home


* '''Decentralized''' may mean a more hierarchical, or many-master (rather than one-master-many-workers) setup, or sometimes to cases with no real master nodes
: e.g. DNS is mostly centralized, but with some hierarchical-style cacheing


-->
* '''Distributed''' refers to a group of nodes that knows about each other and communicate as they see fit (details varying by algorithm, agreement, protocol)
: bittorrent's data transfer is distributed (good for most nodes). Knowledge of peers in the swarm starts centralized, and is updated in a distributed way.




===Split-brain===
'''Scalability'''


Split-brain describes a situation of inconsistencies between parts of a larger system.
On scalability in broad terms:
* linear scalability only works when things are not dependent steps. The handing-out-distinct-jobs approach is the simplest model - if it applies to the given problem.
* halfway decent scalability only works when there is not a shared resource that has to be locked and waited for, or a bottleneck because of bandwidth
* scalability is hurt by locks, though not necessarily that much unless you have central / slow ones


On scalability on a single computer:
* scalability is hurt somewhat by context switches
* brief but frequent locks may interact badly with certain types of scheduling
* timeslicing schedulers may not scale well
* Classical threading doesn't actually scale well beyond a few hundred threads{{verify}}


It ''sometimes'' used for intentional design, e.g. private-network DNS and outside-facing DNS giving different answers.


...but much more usually to cases like
* distributed services failing to keep synchronized
* synchronization not happening properly around times of failover.


* problems in failover logic, like multiple servers believing they should be the new master (often takes ''relatively'' pathological cases)
* '''Scaling down''': not letting individual processing instances clog up the system when there is no work for them




Dealing with split-brain often involves a tradeoff between availability and consistency.
-->


===="Decentralized"====
ELSEWHERE


https://en.wikipedia.org/wiki/Split-brain_(computing)
===Glossary - further concepts===


===Glossary of scaling===
====Coroutines====
{{stub}}
<!--


Types of scaling {{comment|(note that this has become a bit more of a continuum in the multi-core, multi-CPU, and have-GPU world)}}
Coroutines are things that can react to each other, yet not in a fundamentally subordinate way.
* '''horizontal scaling''', a.k.a. '''scaling out''':
: more boxes: adding more computers to do parts of a larger job
: ...either all with the same information, and a knowledge of who works on what,
: ...or '''horizontal partitioning''' , a.k.a. '''sharding''', to distribute distinct, self-contained jobs <!--
: 'scaling in' is a term used in automatic partitioning, where you stop VMs
-->


For context, ''sub''routines are things where one is under the control of the other.


* '''vertical scaling''', a.k.a. '''scaling up''':
This is simple, but in some concurrency that's an annoyingly restrictive model,  
: bigger boxes: adding more processing power, e.g. better/more CPUs or GPUs, to the same computer
so people went more abstract and thought up the concept of coroutines.
: In comparison to horizontal scaling, this tends to be
:: faster up to some point (due to local interconnects being faster than most networking - but this is also why specialized networking blurs this line)
:: often pricier for the same compute, so only worth it for some types of jobs (e.g. those that require a lot of computation ''and'' near-zero latency between parts)
::: In practice, you'd do a cost-benefit estimation of how far to scale up each node before you start scaling out
::: sometimes a little more stable, in that a single node or single network connection can't trip up large jobs
: At some point, speed of interconnection becomes the main issue.
:: particularly when you ''need'' unusually-high-speed/low-latency networking.


<!--
* You also see horizontal partitioning versus vertical partitioning, though relating more to storage, and oten explained in terms of databases and/or relational algebra (SELECT versus PROJECT)
: say you have a lot of users and a lot of columns and decide you want that in smaller chunks somehow
:: horizontal storage might split into lots of tables with the same schema, and split e.g. based on first lett of username
:: vertical storage would be splitting the columns between tables -- i.e. have lots of tables with all with the the same users, but have tables with specific
::: This ''can'' be done in the form of normalization, but can be simpler splitting (which can be more finicky)
::: can help performance when it splits off rarely-queried data
-->


Implementation details and uses vary.
They may work out as communicating equals, and occasionally as related subordinates.




When you write functions that are functionally ''co''routines, then you may wish to look for an alternate model for their execution, often to get more done with less bookkeeping (and potentially more efficiency and scalably).


* '''cluster computing''' versus '''grid computing''' is not a clear distinction, though in general the associations are:
When looking around, you'll find that models that include coroutines often come with [[continuations]] too, because that's often useful. {{comment|(though it does make the distinction between coroutine and continuation fuzzier)}}
: a cluster is more homogeneous, i.e. mostly identical hardware
: a grid heterogeneous, "whatever you have"


: a cluster is often a tightly coupled, predetermined set of computers, and typically all in the same rack(s)
: a grid is often more loosely coupled and more widely spread


: a cluster's jobs are often centralized and synchronized,  
Examples may focus on functional model, or on execution - in various contexts, things that are coroutines at their core are wrapped into subroutines, multithreading, multiprocessing, or such because that makes more sense in the immediate context.
: a grid's jobs may be more more loose/flexible/dynamic/automatic (...by necessity)


: a cluster is more likely to have a central node give jobs to the workers
Examples could be said to include
: a grid is more likely to have each client ask for work from the central node
* many things that feel natural to place in separate threads
* pipes
* producer-consumer style communication (useful with threading, but also in non-threaded processes)


<!--


* '''cloud computing''' is more of a management term than a technical one.
For contrast:
: sometimes means cluster
: but the way these are scaled means don't count on very-low latencies (e.g. AWS/Azure) (should be stable, but not as low as it could be)
: then again, high-performance computing may well be set up as a cluster, because some people need the fast synchronization


When you want two pieces of code to cooperate, basic imperative programming gives you roughly three types of options: * mutual recursion (may be problematic because of stack size),
* sequential calls with some agreed state (a lot of bookkeeping)
* and threading (needs bookkeeping, locking, so is often overkill for the task, and hard to get race-free).


* '''Centralized''' tends to refer to setups where a lot of nodes report to their master - usually a single master
-->
: e.g. scientific grids like PrimeNet, Seti@Home


* '''Decentralized''' may mean a more hierarchical, or many-master (rather than one-master-many-workers) setup, or sometimes to cases with no real master nodes
====Continuations====
: e.g. DNS is mostly centralized, but with some hierarchical-style cacheing


* '''Distributed''' refers to a group of nodes that knows about each other and communicate as they see fit (details varying by algorithm, agreement, protocol)
Informally, continuations are the abstract concept of 'the rest of a calculation'.
: bittorrent's data transfer is distributed (good for most nodes). Knowledge of peers in the swarm starts centralized, and is updated in a distributed way.  




'''Scalability'''
Sometimes as just a thought exercise, sometimes in a very duct-taped way,
but when we give it the name continuation we often have a more generic language/library/interface framework that somehow allows a chunk of nontrivial code to be chunked/paused, and continued in some way or other.


On scalability in broad terms:
Ideally in a way that is ''not'' clunky to write or understand.
* linear scalability only works when things are not dependent steps. The handing-out-distinct-jobs approach is the simplest model - if it applies to the given problem.
* halfway decent scalability only works when there is not a shared resource that has to be locked and waited for, or a bottleneck because of bandwidth
* scalability is hurt by locks, though not necessarily that much unless you have central / slow ones


On scalability on a single computer:
* scalability is hurt somewhat by context switches
* brief but frequent locks may interact badly with certain types of scheduling
* timeslicing schedulers may not scale well
* Classical threading doesn't actually scale well beyond a few hundred threads{{verify}}


Examples will typically also tie in practical execution details,
like async, multithreading, multiprocessing,




* '''Scaling down''': not letting individual processing instances clog up the system when there is no work for them


For those familiar with [http://en.wikipedia.org/wiki/Generator_%28computer_science%29 generators], those are effectively continuation-style iterables.


-->
Generators demonstrate one of the things continuations can be useful for: You can get work on demand - just the work necessary for the next yield, and, you get such state without theading.


===="Decentralized"====
ELSEWHERE


Continuations can also be useful for concurrency-geared control structures.


==Scaling, parallelizing and such==
<!--
Done for one or more of:
* performance of a single host
* scale a system to handle more data (e.g. get more (distributed) memory)
* scale a system to handle more work


===Glossary - further concepts===


====Coroutines====
It's basically never cheaper in terms of hardware,
<!--
but can be worth it for speed of tasks - that can be centrally important to you.


Coroutines are things that can react to each other, yet not in a fundamentally subordinate way.


For context, ''sub''routines are things where one is under the control of the other.


This is simple, but in some concurrency that's an annoyingly restrictive model,
Performance on a single host:
so people went more abstract and thought up the concept of coroutines.
* actually use multiple processors/cores available to you (under SMP or such). Can scale well, particularly if software is aware of it.


* Even on a single-core, single-CPU system if a single process involves points of idleness that mean idle resources - even just two or three threads can mean resources are saturated


Implementation details and uses vary.
Also, tuning the amount of threads lets you balance response latency with response speed/throughput (assuming the CPU is the largest bottleneck).
They may work out as communicating equals, and occasionally as related subordinates.


-->
===Some inherent limitations and choices===
{{stub}}


When you write functions that are functionally ''co''routines, then you may wish to look for an alternate model for their execution, often to get more done with less bookkeeping (and potentially more efficiency and scalably).
Common fallacies of distributed computing (suggested by Deutsch and various others):
* bandwidth is infinite
* latency is zero<!--{{comment|(not the same, consider also switching)}}-->
* transport cost is zero
* the network is reliable
* the network is secure
* the network is homogeneous
* topology doesn't change
* there is one administrator
 
If you ''assume'' any of these, that may return as a fundamental limitation.


When looking around, you'll find that models that include coroutines often come with [[continuations]] too, because that's often useful. {{comment|(though it does make the distinction between coroutine and continuation fuzzier)}}
''Some'' fairly easy acceptable solutions - but often at some cost, possibly in one of the other listed aspects.




Examples may focus on functional model, or on execution - in various contexts, things that are coroutines at their core are wrapped into subroutines, multithreading, multiprocessing, or such because that makes more sense in the immediate context.
====Trusism: Nothing saves you from not enough resources for your load====


Examples could be said to include
When you are writing a low latency server, web- or otherwise
* many things that feel natural to place in separate threads
: If you are accepting enough requests for more than one second of CPUwork per second, nothing will save you.
* pipes
* producer-consumer style communication (useful with threading, but also in non-threaded processes)


: If you are accepting enough requests for more than one second of IO work per second, nothing will save you.


For contrast:


When you want two pieces of code to cooperate, basic imperative programming gives you roughly three types of options: * mutual recursion (may be problematic because of stack size),
* sequential calls with some agreed state (a lot of bookkeeping)
* and threading (needs bookkeeping, locking, so is often overkill for the task, and hard to get race-free).


-->
Varied cleverness is tradeoffs only, and often exceedingly basic ones.


====Continuations====
If you have multiple cores, you probably want as many workers, to you know, ''actually use them''.


Informally, continuations are the abstract concept of 'the rest of a calculation'.
Say, memcaches are primarily about storing what you don't need to calculate each time - because it's redundant to start with.


memcached may stick around 1ms ...but only because its operations are restricted to extremely basic things.


Sometimes as just a thought exercise, sometimes in a very duct-taped way,
but when we give it the name continuation we often have a more generic language/library/interface framework that somehow allows a chunk of nontrivial code to be chunked/paused, and continued in some way or other.
Ideally in a way that is ''not'' clunky to write or understand.


A lot of subtleties help some solid but small percentage


Examples will typically also tie in practical execution details,
like async, multithreading, multiprocessing,


====Serial processing can always be a limit====


For those familiar with [http://en.wikipedia.org/wiki/Generator_%28computer_science%29 generators], those are effectively continuation-style iterables.
Generators demonstrate one of the things continuations can be useful for: You can get work on demand - just the work necessary for the next yield, and, you get such state without theading.
Continuations can also be useful for concurrency-geared control structures.
==Scaling, parallelizing and such==
<!--
<!--
Done for one or more of:
* performance of a single host
* scale a system to handle more data (e.g. get more (distributed) memory)
* scale a system to handle more work


There are dumber examples. If you want to delete a million files on a classical filesystem,


It's basically never cheaper in terms of hardware,
1000 times maybe 50ms for each is... almost a minute.  
but can be worth it for speed of tasks - that can be centrally important to you.


Not because it's hard,
but because that filesystem has probably defined that each queued operation must happen before the next,
to keep potential interactions well defined.




Performance on a single host:
* actually use multiple processors/cores available to you (under SMP or such). Can scale well, particularly if software is aware of it.
* Even on a single-core, single-CPU system if a single process involves points of idleness that mean idle resources - even just two or three threads can mean resources are saturated


Also, tuning the amount of threads lets you balance response latency with response speed/throughput (assuming the CPU is the largest bottleneck).


-->
===Some inherent limitations and choices===
{{stub}}
Common fallacies of distributed computing (suggested by Deutsch and various others):
* bandwidth is infinite
* latency is zero<!--{{comment|(not the same, consider also switching)}}-->
* transport cost is zero
* the network is reliable
* the network is secure
* the network is homogeneous
* topology doesn't change
* there is one administrator


If you ''assume'' any of these, that may return as a fundamental limitation.
Say, listing 26 million things form a datababase will not likely be fast.


''Some'' fairly easy acceptable solutions - but often at some cost, possibly in one of the other listed aspects.


Not because moving an existing list of 26 million takes that long, not because not because it's 10 seconds of moving or thinking.


====Trusism: Nothing saves you from not enough resources for your load====
Because if you do 26 million things,
and they have to happen after the other,
then each taking a respectable 0.4 microseconds still adds up to that much.


When you are writing a low latency server, web- or otherwise
: If you are accepting enough requests for more than one second of CPUwork per second, nothing will save you.


: If you are accepting enough requests for more than one second of IO work per second, nothing will save you.
And no, parallelizing that doesn't necessarily help.
Distributing it doesn't necessarily help.


Both of those have more overhead.


That overhead stays relatively constant for ''even more'',
which means both that
it's probably always faster for for billions of items,
and probably always be slower for just tens of thousands.


Varied cleverness is tradeoffs only, and often exceedingly basic ones.


If you have multiple cores, you probably want as many workers, to you know, ''actually use them''.
-->


Say, memcaches are primarily about storing what you don't need to calculate each time - because it's redundant to start with.


memcached may stick around 1ms ...but only because its operations are restricted to extremely basic things.
====The mobility gap====


Based on the observations that:


A lot of subtleties help some solid but small percentage
: Moore's law - processing growing exponentially {{comment|(currently it's often the one we have to think about least)}}


: Kryder's law - storage growing exponentially {{comment|(scaling takes some worry in relation to error+failure, but we've started dealing with that)}},


====Mobility gap====
: Nielson's law - network speed growing more or less linearly.


What some have called the '''mobility gap''' is based on the observations that:


: Moore's law describes processing growing exponentially {{comment|(currently it's often the one we have to think about least)}}
Those combine to mean that over time, the way we are likely to design larger systems mean
we spend more time ''sending'' data around instead of ''processing'' it.


: Kryder's law describes storage growing exponentially {{comment|(scaling takes some worry in relation to error+failure, but we've started dealing with that)}},


: Nielson's law describes network speed growing more or less linearly.
We have called this the '''mobility gap'''.


Given the above patterns, this will probably continue to change how we solve problems.


One implication is that over time, we spend more time sending data around than processing it.


Which implies it is preferable to process it where it is currently stored.
Which implies it is preferable to process it where it is currently stored.


Among many other implications, this is why approaches like map-reduce became useful, even necessary:
Put another way, move programs to the data, rather than data to programs, which is why approaches like map-reduce became useful, even necessary:
It is becoming ''significantly'' cheaper to move programs to the data than to move data to programs.
Given the above patterns, this will continue to change how we solve problems.


====Brewer's CAP theorem====
====Brewer's CAP theorem====
See [[Relational_and_non-relational_database_notes#CAP]]
See [[Data_consistency_and_versioning,_and_its_concepts#CAP]]


====Scope of consistency====
====Scope of consistency====
Line 1,686: Line 2,022:


Example of a mutex, via POSIX threads:
Example of a mutex, via POSIX threads:
<code lang="c">
<syntaxhighlight lang="c">
#include <pthread.h>
#include <pthread.h>


Line 1,704: Line 2,040:
     return (ret);
     return (ret);
}
}
</code>
</syntaxhighlight >
[https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html POSIX threading] actually defines some variants, including one that won't deadlock when you try to lock twice.
[https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html POSIX threading] actually defines some variants, including one that won't deadlock when you try to lock twice.


Line 2,128: Line 2,464:
* <tt>-w ''n''</tt> is a timeout in seconds - after which to give up and exit
* <tt>-w ''n''</tt> is a timeout in seconds - after which to give up and exit


<code lang="bash">
<syntaxhighlight lang="bash">
# in the background to have a thing to test against
# in the background to have a thing to test against
flock      foo -c 'sleep 1h'  
flock      foo -c 'sleep 1h'  
Line 2,137: Line 2,473:
# give up after 5 seconds
# give up after 5 seconds
flock -w 5 foo -c true || echo nope
flock -w 5 foo -c true || echo nope
</code>
</syntaxhighlight >




Line 2,732: Line 3,068:
====Hello world====
====Hello world====
A minimal hello world is:  
A minimal hello world is:  
<code lang="c">
<syntaxhighlight lang="c">
#include <stdio.h>
#include <stdio.h>
#include <mpi.h>
#include <mpi.h>
Line 2,743: Line 3,079:
   MPI_Finalize();          // ditto on the error code
   MPI_Finalize();          // ditto on the error code
}
}
</code>
</syntaxhighlight >
As part of the init (which gets some details via the fact that you'll start it via {{inlinecode|mpirun}} with something like <tt>-np 4</tt>), one process takes care of starting the others, which means that all of them print <tt>hello world</tt>.
As part of the init (which gets some details via the fact that you'll start it via {{inlinecode|mpirun}} with something like <tt>-np 4</tt>), one process takes care of starting the others, which means that all of them print <tt>hello world</tt>.



Latest revision as of 13:04, 16 May 2024

Some fragmented programming-related notes, not meant as introduction or tutorial

Data: Numbers in computers ·· Computer dates and times ·· Data structures

Wider abstractions: Programming language typology and glossary · Generics and templating ·· Some abstractions around programming · · Computational complexity theory notes · Synchronous, asynchronous · First-class citizen

Syntaxy abstractions: Constness · Memory aliasing · Binding, assignment, and such · Closures · Context manager · Garbage collection

Sharing stuff: Communicated state and calls · Locking, data versioning, concurrency, and larger-scale computing notes

Language specific: Python notes ·· C and C++ notes · Compiling and linking ·· Lua notes

Teams and products: Programming in teams, working on larger systems, keeping code healthy · Benchmarking, performance testing, load testing, stress testing, etc. · Maintainability

More applied notes: Optimized number crunching · File polling, event notification · Webdev · GUI toolkit notes

Mechanics of duct taping software together: Automation, remote management, configuration management · Build tool notes · Installers


Pieces of larger or complex systems
Varied databases
Message brokers/queues; job management
Locking, data versioning, concurrency, and larger-scale computing notes
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Glossary

Glossary of (concurrent) computation

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.





"Thread"

Thread of execution
Threads in a processor
The sidequest of superscalar details

Threading within OS concepts

Multithreading

Concurrency and parallelism

Coroutines and cooperative multitasking

async and await

async and await are syntactic sugar present in some languages, that often mean you're doing cooperative multitasking -- coroutine style programming.

...where the interpreter/compiler will be helping you judge correctness.


See e.g.

Glossary of exclusiveness and messiness

Atomic operations

Critical section

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Races

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


A race condition, a.k.a. a race, is when the behaviour of software (or electronics) depends on the timing of events outside of your control.

Often, "it depends on which part got there first, and there's no telling which one that will be".


For example, when multiple concurrent tasks (where having tasks be independent is part of the point) are accessing the same data (be it files, shared memory, or whatnot), in a non-isolated / non-atomic way, what will happen depends highly on the timing (and thereby ordering) of the operations of both.


Most code is written to assume it is alone and that the data can't change while it works on it.


Particularly around pre-emptive multi-tasking, that is an incorrect assumption. There is little telling where in the sequence of operations a task switch occurs, so there are many variants of interactions, making the possible errors hard to predict, or test for.


Depending on what is actually getting done, the result is anything from the result being incorrect (e.g. two threads adding 1 to a counter just means the count will be too low), to a total system crash.



Race conditions are most commonly solved by making a piece of code, or a resource, usable by only one thread at a time.

This is usually done with mutexes or semaphores. (...or critical sections, which suspend all threads but the executing one regardless of what they are doing. Which in ways is more of a a hack than a solution).


If you do a lot of state sharing, note that there are ways to opimize this. For example, lock-free data structures are often also race-free with in ways that have less effect on performance than "lock everything" style solutions.


There are some libraries that cab help you make code free from race conditions even with threading. There are even programming languages with semantics that do this (mostly). They still tend to have some tradeoff between simple to understand / fast / other things you may care about.



https://en.wikipedia.org/wiki/Race_condition

https://wiki.c2.com/?RaceCondition

Thread-(un)safe

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Thread safety, intuitively, asks "can you access the same thing from multiple threads without that being the reason things break?".


It is technically more precise to call this concurrency-safe, because this can happen anytime you have more than one 'thread of execution', in the generic-rather-than-just-multihreading sense.

That said, it is so much easier to break things in more magical ways with threads than with most any other type of concurrency. Not because of threads per se, but because of what you do: start with shared everything, zero safety, and begrudgingly add a bit of protection. Other methods share less, or tell you what good style is. Threads just throw you in the deep end.


This is primarily about

state belonging to each thread of execution should stay isolated to that thread (as much as you sensibly can)
any state you must share between threads will be accessed in a sequential way (or otherwise safe)


This includes, but is not limited to, cases where two threads call the same code, and that code keeps its own state - see re-entrant code for relevant issues.


Deadlock, livelock

Deadlock and livelock are both situations where the locking happening in two or more threads is the reason things are blocked from getting done.


A deadlock is when threads end up waiting for the other(s), indefinitely.

The typical example is one with two locks, where the order of locking them is the problem:

Consider

two threads, call them P1 and P2
two locks (protecting two resources), call them A and B
the code in both threads is "lock one, lock the other, do work, unlock both"

Because the timing of what the threads do is unrelated, a lot of the possible orders can happen. You could write them out of you want a sense of completeness, but perhaps most interesting is one of the problem cases:

P1 locks A, nothing wrong with that
P2 locks B, nothing wrong with that
P2 locks A - which is currently already locked so we start waiting (for P1) to release it
P1 locks B - which is currently already locked so we start waiting (for P2) to release it

There is nothing wrong with each individual action, except that in this specific sequence of events,

both are now waiting indefinitely on the other, before they can start work.
neither will release locks until they're done with that work (which we just mentioned neither will start with)
This can never be resolved, so both will wait indefinitely - this is a deadlock.


There are ways to detect deadlocks ahead of time.


There are other ways to avoid them that are sometimes simpler.

...including less finely grained locks.

Consider a single lock that protects both mentioned resources at the same time.

For the above case ("lock one, lock the other, do work, unlock both") this is actually equivalent -- the processes make their work purely sequential

In the general case, it is easily less efficient to use one big lock, because things that don't even touch the resources may now be waiting on each other.

Yet it's better than hanging forever.

And often it's not too hard to think something up that is safe but doesn't defeats the concurrency entirely.


Other deadlocks are resolved for you. Say, relational databases will typically contain a lot of logic that basically means that if two transactions happen to deadlock, one will be rolled back.


Specific deadlock situations may be easy enough to detect happening ahead of time, but this require knowledge of the situation, its locks, and how to write such checks without race conditions.


Sometimes there's a cheat in locks with timeouts and make your code retry. ...but this doesn't scale well at all, and even at smaller scale can cause other problems, like livelocks instead.


When you get to design the locking (e.g. in your own programs, rather than in a given database), consider different approach to locking.


There has been plenty of research into lock-free data structures, cleverer lock relief, and such.



A livelock refers to situations where there is regular locking and unlocking going on, but the system is trapped in a limited subset of its states and it may not get out of this subset within acceptable time, or possibly even at all.

Often these states are functional/bookkeeping states between actual work, so it's not doing anything useful either. Meaning you can see it as a type of starvation, and arguably practically equivalent to a deadlock.

It can happen when the concurrency model is loose and/or complex, but exclusive locking is still an important part of it.

Livelocks may be harder to predict, detect, and may depend on secondary effects like timing and amount of load.


Starvation

(Lock) contention, false contention, lock convoys

Thundering herd

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Locks and lock-like implementations

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
Lock hierarchies

Re-entrant code

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Re-entrant code may be defined most by the idea that code can be called when another call on that code is already underway

...re-entrant literally just refers to entering that code again.


This can still be meant in a few ways, and you usually hear a slightly more specific term/property (though they are closely related), like:

  • is safe to pause and continue execution later
  • can be safely called before a previous invocation has completed
  • code that can be called by multiple threads of execution
  • execution that keeps only local, static, non-shared state

...most of which don't so much define what re-entrancy is (these are e.g. also properties you care about around recursive functions), as it is thinking about a specific ways to (not) violate it.


And yes, in most cases, most of what ensures re-entrancy is about not sharing state with other execution (e.g. keeping your state to parameters and local variables). But

the case-specific details are also important for a more complete grasp.
you sometimes have to share state (a lot of heap allocation is), and it is your code that should think about ensuring re-entrancy


Recursion may be the easiest introduction, because it illustrates that the state of each call into a function can (and should) be kept entirely distinct from every other (each with its own scope based on its own stack frame), and that you don't need concurrency (that you don't control) to worry about re-entrancy.


Another example is the C library function strtok, which which lets you split a string by a delimiter. It was written so that repeat calls give you just the next token (until there are no more), and the original version would basically remember itself where it was in the string, making repeat calls easier. That memory is however in global state, which means that you cannot start tokenizing a new string until you're done with the last. The variant strtok_r basically makes you remember the state (has an extra parameter). It's a little more typing, but this makes it re-entrant.


But beyond that, re-entrancy is is also quality you care about in a system that does concurrency (multitasking, threading, but also concurrency that is purely sequential and cooperative), and is also relevant around hardware interrupts, callbacks that may fire quickly (e.g. on a timer) and may overlap, and some other things.

Each of those are their own topic, because what makes code unsafe in each context is specific, as are the solutions (e.g. hardware interrupts have some unique worries).


Around concurrency, your largest worries are often

  • not using shared or global state
  • atomic changes, because races are an easy way to make something violate re-entrancy





Re-entrant code is intuitively related to thread safety, because both are about doing things safely with probably-shared state.

However,

  • code can be re-entrant but not thread-safe
in part because re-entrant can be about safety within just one thread.
  • code can be thread-safe without being re-entrant
e.g. when the violation doesn't matter - e.g. use of random_r() is important for fully reproducable results, but in a bunch of cases random() will be acceptable

...or both, or neither.

There are some good examples here: https://stackoverflow.com/questions/856823/threadsafe-vs-re-entrant


That said, a lot of code either

doesn't have to worry about either, or
makes it easy, or hard-but-important, to do both




Split-brain

Split-brain describes a situation of inconsistencies between parts of a larger system.


It sometimes used for intentional design, e.g. private-network DNS and outside-facing DNS giving different answers.

...but much more usually to cases like

  • distributed services failing to keep synchronized
  • synchronization not happening properly around times of failover.
  • problems in failover logic, like multiple servers believing they should be the new master (often takes relatively pathological cases)


Dealing with split-brain often involves a tradeoff between availability and consistency.


https://en.wikipedia.org/wiki/Split-brain_(computing)

Glossary of scaling

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Types of scaling (note that this has become a bit more of a continuum in the multi-core, multi-CPU, and have-GPU world)

  • horizontal scaling, a.k.a. scaling out:
more boxes: adding more computers to do parts of a larger job
...either all with the same information, and a knowledge of who works on what,
...or horizontal partitioning , a.k.a. sharding, to distribute into entirely distinct, self-contained jobs


  • vertical scaling, a.k.a. scaling up:
bigger boxes: adding more processing power, e.g. better/more CPUs or GPUs, to the same computer
In comparison to horizontal scaling, this tends to be
faster up to some point (due to local interconnects being faster than most networking - but this is also why specialized networking blurs this line)
often pricier for the same compute, so only worth it for some types of jobs (e.g. those that require a lot of computation and near-zero latency between parts)
In practice, you'd do a cost-benefit estimation of how far to scale up each node before you start scaling out
sometimes a little more stable, in that a single node or single network connection can't trip up large jobs
At some point, speed of interconnection becomes the main issue.
particularly when you need unusually-high-speed/low-latency networking.



  • cluster computing versus grid computing is not a clear distinction, though in general the associations are:
a cluster is more homogeneous, i.e. mostly identical hardware
a grid heterogeneous, "whatever you have"
a cluster is often a tightly coupled, predetermined set of computers, and typically all in the same rack(s)
a grid is often more loosely coupled and more widely spread
a cluster's jobs are often centralized and synchronized,
a grid's jobs may be more more loose/flexible/dynamic/automatic (...by necessity)
a cluster is more likely to have a central node give jobs to the workers
a grid is more likely to have each client ask for work from the central node


"Decentralized"

ELSEWHERE

Glossary - further concepts

Coroutines

Continuations

Informally, continuations are the abstract concept of 'the rest of a calculation'.


Sometimes as just a thought exercise, sometimes in a very duct-taped way, but when we give it the name continuation we often have a more generic language/library/interface framework that somehow allows a chunk of nontrivial code to be chunked/paused, and continued in some way or other.

Ideally in a way that is not clunky to write or understand.


Examples will typically also tie in practical execution details, like async, multithreading, multiprocessing,


For those familiar with generators, those are effectively continuation-style iterables.

Generators demonstrate one of the things continuations can be useful for: You can get work on demand - just the work necessary for the next yield, and, you get such state without theading.


Continuations can also be useful for concurrency-geared control structures.

Scaling, parallelizing and such

Some inherent limitations and choices

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Common fallacies of distributed computing (suggested by Deutsch and various others):

  • bandwidth is infinite
  • latency is zero
  • transport cost is zero
  • the network is reliable
  • the network is secure
  • the network is homogeneous
  • topology doesn't change
  • there is one administrator

If you assume any of these, that may return as a fundamental limitation.

Some fairly easy acceptable solutions - but often at some cost, possibly in one of the other listed aspects.


Trusism: Nothing saves you from not enough resources for your load

When you are writing a low latency server, web- or otherwise

If you are accepting enough requests for more than one second of CPUwork per second, nothing will save you.
If you are accepting enough requests for more than one second of IO work per second, nothing will save you.


Varied cleverness is tradeoffs only, and often exceedingly basic ones.

If you have multiple cores, you probably want as many workers, to you know, actually use them.

Say, memcaches are primarily about storing what you don't need to calculate each time - because it's redundant to start with.

memcached may stick around 1ms ...but only because its operations are restricted to extremely basic things.


A lot of subtleties help some solid but small percentage


Serial processing can always be a limit

The mobility gap

Based on the observations that:

Moore's law - processing growing exponentially (currently it's often the one we have to think about least)
Kryder's law - storage growing exponentially (scaling takes some worry in relation to error+failure, but we've started dealing with that),
Nielson's law - network speed growing more or less linearly.


Those combine to mean that over time, the way we are likely to design larger systems mean we spend more time sending data around instead of processing it.


We have called this the mobility gap.

Given the above patterns, this will probably continue to change how we solve problems.


Which implies it is preferable to process it where it is currently stored.

Put another way, move programs to the data, rather than data to programs, which is why approaches like map-reduce became useful, even necessary:

Brewer's CAP theorem

See Data_consistency_and_versioning,_and_its_concepts#CAP

Scope of consistency

Consensus

Offline mode

Amdahl's law

See also:

Storage

Flavours

Further considerations

On locks

Re-entrant lock

Critical sections

Mutex
Futex

'Fast Userspace Mutex' (appeared in linux 2.6 kernels)

The idea behind futexes is that when there is contention on the futex, things can be resolved in userspace library code rather than kernel, so in that case it is faster than a mutex, which always incurs a system calls.


In the case there is contention on the futex, arbitration is assisted by a kernel-space wait queue like any mutex.

In other words, for low-volume locking it is often a more efficient.

Spinlock, busy waiting

Semaphore

higher-level constucts, communication models, and such

Granularity of locking

Opportunistic locking

A Microsoft name, mostly for how it does client locking in SMB/CIFS context.


This seems to be a server-mediated lock that counts on there being one main user for a file, and does some signalling whenever there are more.

This lets the first user cache contents locally (and buffer changes) as if it had a read/write lock on the server. More complex lock/cache semantics only happens when additional users come in on the same file.

Those semantics are roughly that the second client requests a lock break, which means it asks the server to ask the first client to write its changes. (what's the state of the locks after that, though?)


https://web.archive.org/web/20190110123229/https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/overview

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365433(v=vs.85).aspx

https://www.samba.org/samba/docs/old/Samba3-HOWTO/locking.html

Optimistic locking, pessimistic locking

Optimistic locking means you start working under the assumption that most of the time, there is no conflict.


For example, you might

  • fetch a record (plus something that lets you check correctness later, like a version or timestamp),
  • figure out what the new record would be
  • check the changes are valid and save
If that checkable thing checks out (e.g. that record did not change its version or timestamp), you can just replace it.
If not, you decide what you should do instead -- and this may be as complex as any rollback or merge logic.


The difference is that instead of putting a lock before all that and an unlock after, now only that last 'check and save' needs to be locked.

If it is rare that this checkable thing fails, you can avoid a lot of locking, and you might avoid lock convoys (e.g. when operations are quick and small, compared to the locking operation)).


That said, in uses where you typically do run into conflicts, this will be no better (still creating lock convoys), and potentially worse (e.g. when there's repeated redoing).

This this does not distribute or scale very well. There are similar tricks to this in scaling, but their implementation is specific.


Optimistic locking is a good idea if you know the access patterns well. For example, "update our search index every hour" is likely to have few or no such collisions because it's essentially a sequential batch job.



Pessimistic locking is, then, roughly the assumption that collisions are either common or troublesome enough (consider also races, dirty reads) that locking always is the better choice, leading to behaviour may not be the fastest, but is easy to prove the correctness of.

It's the right thing to do for strict consistency management (personal information, financial transactions), but can kill performance for sometimes no good reason.

Advisory and mandatory locks

Advisory locks are those that are not enforced.

Each relevant application chooses to observe them correctly - or may not.

Whereas mandatory locks are enforced for you.


Advisory locks are useful when

  • there is no easy way to implement proper locking, e.g. you are a random program wanting to exclude another copy of yourself (and you can't count on kernel or services features being there)
...but you do want to try your best regardness
  • you want pessimistic locking where in a context that allows optimistic locking


An advisory lock can be as simple as "a file existing here", though because this implies no-transaction context, this allows races and other edge cases.

Lock files and file locks

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
?¿ Some package/project managers use a something.lock file not in the sense of exclusive access as detailed below, but to mean "we would like you to install this version of a dependent package" - to lock you to that version (e.g. Pipfile.lock, poetry.lock). This is entirely unrelated.


Lock files are a mechanism of advisory locks, often using files with predictable names as markers for what is in use.

For example, a program can create a file foo.lock to signal that foo is currently in use. (such a file does not need any contents, though sometimes it might contain a process ID to get an idea of who is using it)

Any other programs that wish to observe this convention can detect this, and e.g. delay opening that file until that lock is gone.

Since it's typically polled, it is not free of race conditions, nor very fast, but for a number of applications it's already a lot better than just assuming you're the only instance (and clobbering each other's work, editing the same file, or such).


File locks usually refer to mandatory locks that you can make on filenames, which make certain file operations block (for other processes) until released.

This is often done at OS level, potentially only on specific file systems.

Doing this on networked filesystems requires distributed lock management, which is easily too inefficient or too fragile and doesn't scale well -- so is often not supported until you do extra work.

There are cases in which it works well, there are cases in which it really amounts to advisory locking and you might be better off using actual advisory locking like lock files, there are cases for which you want to build something cleverer.



  • nix file locking
  • POSIX file locks
fcntl(F_SET_LK)
upsides
portable
can work across NFS
byte-range locking
downsides
tied to to processes, not file descriptors - meaning it doesn't solve much around threading
not secure, e.g. released on close() which can mean embedded code could break your locking logic
  • BSD flock
locks tied to file descriptors, not processes
upsides
safer around threads, useful around forks
downsides
not secure
not useful across NFS
less portable than POSIX
on linux may actually be implemented using POSIX(verify)

inter-process locks

Advisory locking is typically done via lock files, are not very strict but may be good enough for any purpose where apps want to cooperate.


Mandatory locks are possible but tend to be less portable (...fnctl and/or mount options).


Process-shared mutex


named mutex/semaphores


IPC semaphores


SHM mechanisms

flock notes
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The flock command wraps flock() syscalls around commands that it runs, which means such commands on the same lockpath will run in series.

Example:

flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &
flock /tmp/sleepdemo sh -c 'echo Start;sleep 4;echo Done' &

A more practical example might to make cron jobs that may sometimes be long-running sometimes not overlap (though note if they always are, they would seriously queue instead).



If you want to test for a lock being present (rather than unconditionally queue a command), then you want either:

  • -n means nonblocking - exit immediately, instead of waiting indefinitely
  • -w n is a timeout in seconds - after which to give up and exit
# in the background to have a thing to test against
flock      foo -c 'sleep 1h' 
 
# give up immediately
flock -n   foo -c true || echo nope
 
# give up after 5 seconds
flock -w 5 foo -c true || echo nope


Notes:

  • will create nonexisting lock files
  • Without a command, you'll get a "flock: bad file descriptor" - why? (verify)
  • many but not all filesystems support this
  • You can see what's holding a lock on a file like fuser -v foo

Non-blocking, lock-free, wait-free

On data

Communicating

With any sort of parallel communication, atomic communication of data can be a problem, particularly whenever threads are involved.

A now fairly common solution is to model in demand/supply in some way, commonly via queues, and/or generators. Such a setup forces the programmer to read and write packages of information at a time, and can allow lazy creation.

This may imply that you should always communicate copies of data, and not heap references.

When handling large amounts of data and you wish to avoid unnecessary copies (databases would, for example) you could probably get away with MVCC-style setup, but it does mean a lot of management even if it avoids locking.


See how things are handled e.g. in

Versioning

lock-edit-commit-unlock

A resource is locked as a whole.


The easiest to code.


Also tends to scale badly, because things are limited by the speed the lock can be locked and unlocked.


Latency reduces the maximum rate, and things will sit around waiting, leading to lock contention.

(The way they wait can also be relevant, see e.g. polling/spinlocks versus event-driven things)


Typically the same actor that locked it must unlock it.

...so without timeouts, this can can leave the resource unusable.


There are also potential deadlocks if actors want to lock multiple thens.


Note that instead of an exclusive lock, you could use an exclusive write lock: others can read, but none should/can write without a lock. This is a little harder to code in that you may now have to deal with atomicity of the changes.

branch-and-merge, copy-edit-replace

Lockless, so people/processes don't lock each other while they work.

A copy(/branch; the difference is mostly in terminology and housekeeping) is edited, and the changes are re-committed.


Can be useful when difference alterations are likely to be independents somehow - different files, different regions in text files, or such.

Often relies on (mostly-)automatic conflict resolution (...partly because manually resolving such problems is probably a time loss comparable to waiting for a lock).

'Just use the last version' may be good enough in some situations -- but that would mostly just replace with a single version and throw away all others, which is only a satisfying solution, well, when it makes sense in the particular application.


MVCC

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


tl;dr:

  • an alternative to aggressive locking - but cannot completely remove the need for locks
  • tends to avoid reading clients blocking writing clients, and the other way around
so e.g. read-heavy workloads tend to wait less
  • can not prevent the occurence of deadlocks
though you can informedly make them less likely


Multiversion Concurrency Control (MVCC) steps away from handling data both monolithically and purely sequentially.

Its focus is on isolating distinct versions of data, and often works out as each client seeing/handling a consistent snapshot of a certain unit of data (where that unit may be a database row, a file, or such) within a certain transaction (e.g. a database transaction, a file operation, etc.).

Conceptually, one client may be working on creating a new version while everything else still sees the current/old version, and only at commit time do we need some detailed handling, not during that reaction.


Details vary with use case (e.g. databases, files, etc), and even with implementation within a use case (e.g. there are a few different MVCC systems in RDBMS systems)


In some applications, it can work out as a variant of copy-edit-replace which is aware of transactions and basic data modeling.


In some situations, for example where commits on the same database tables aren't on the same rows (e.g. different people updating their own profile text) or not conflicting in other ways (say, someone clicking 'set name' on their user profile twice), it can be a perfectly useful middle way between the slowness of locking the entire table for each such operation, and the possible inconsistency of dirty reads and such.


MVCC works well in mostly-read, occasional-write, particularly when there is no strict necessity to see the very latest data at the earliest possible time, as you can serve the latest committed version without delay without it being potentially inconsistent.

From that perspective it provides lockless versioning. If properly set up, it need not break (m)any other guarantees.


When there are multiple writing clients at the same time, things depend on decisions. You can say that the one that committed last becomes current, and the rest is old - it is usually fairly trivial to decide which of these is old, though you can make this cleverer and more complex if necessary.

This may sound like a bad idea, but consider that sequentially handling commits that were isolated alterations on the same basic data amounts to the same idea but with the latest commit time being decisive. In some situations you want really these operations to be serialized, in others it really doesn't matter.


MVCC is not a very good idea when there are dependencies you can break - but that's arguably a misapplication or a fault in modelling, or code - not locking where locking is necessary under most every underlying model.

Note that it certainly matters what your unit of data encompasses (and doesn't), and also whether there is transactioning around it (which may make overall management and cleaning of old versions a little harder - see e.g. the guts of Postgresql's MVCC).


APIs

Larger-scale computing

The Seven Fallacies of Distributed Computing:  Reliable delivery;  Zero latency; Infinite bandwidth;  Secure transmissions;  Stable topology;  Single adminstrator;  Zero cost. --Peter Deutsch


There recently has been a general move to get more processing power by interconnecting many machines, either in your own cluster or in a cloud service. As opposed to putting more power into a single monolithic - and often disproportionately pricy - workhorse.


Which of the specific options is best depends on many things. You can think and read and write about this one for years, but a little common sense thinking about effective bottlenecks goes a long way.


Threading versus message passing

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Broadly speaking, there are two approaches to communication in parallel execution: threading + shared-memory techniques, and message passing.


Multiprocessing models:

  • UMA (Uniform Memory Access), commonly as SMP:
processor owns no memory (other than its cache), and all view the same memory
e.g. a single bus from CPU to memory controller
speed very predictable, no special planning necessary
...but doesn't well beyond a handful of cores
latency ~10ns(verify)
  • NUMA (Non-Uniform Memory Access): processor owns its own memory, and can view that of others more slowly
e.g. HT and QPI means each CPU talks directly its own physical memory (yes, memory controller is in the CPU now), and interchanges internally when necessary.
NUMA allows further (vertical) scaling than strict UMA
but varying distance/time to different memory and dealing with ownership means access speeds vary more. The averages are higher than in UMA/SMP (though in theory scales somewhat better)
cache coherency is more complicated (and slower), which stops the scalability at some point
latency ~100ns(verify)
  • Cluster: nodes owns their own memory, cannot address that of others
(at opcode level; sofware does it instead)
latency ~1000ns or a bunch of multiples more(verify)

Horizontal scaling inevitably slides down this scale.


To quickly contrast:

When the same memory is mapped by the same thread (of execution, so this includes processes, it

  • allows no-copy sharing
  • requires explicit synchronization

On a single workstation/server (UMA/SMP, or small NUMA via HT/QPI), transfer costs are low and predictable, which makes this more efficient than cluster.

It can be error-prone to have a basis that never synchronizes/locks until you realize you need it (though there are now languages with direct awareness of threading safeness, and better threading libraries for those that are not).


Note that UMA and NUMA can be mixed, e.g. on double-CPU motherboards.


Some larger systems still manage to be NUMA fairly efficiently (e.g. being aware of its own topology).

...but at some point you have to give up and break from it, and effectively go cluster instead.


At this point, message passing becomes preferable.

Data is always shared via a send/copy - which usually is the case anyway (and can sometimes be upgraded to no-copy within the same host under certain conditions)

Message passing is cooperative (a send has to be received), synchronization is implicit with complete sends (avoiding many race conditions).

When data serialization is often part of the protocol, this is easily more portable.

...and many of these details can make the overhead higher. How much higher depends on how much you work with it.

Single-host computing

Distributed, parallel, cluster, grid

Memory issues

Commoditized computing (a.k.a. cloud)

See Cloudy notes

Parallel processing libraries

OpenMP

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

OpenMP targets shared-memory systems (only) - so in some ways is just a portable alternative, or wrapper, to pthreads / windows threads.

A compiler-level (C, C++, FORTRAN) thing that decently abstracts, so makes it fairly easy to safely access data structures, make syncing easier, do parallel sections, parallel loops, and such.


Not to be confused with (Open)MPI.

MPI

MPI is a spec, aimed at coordinating a bunch of processes, and optionally any transfers between them.

More technically, MPI is just one of many message passing specifications/implementations.


Implementations include OpenMPI, MPICH, HP-MPI, and others - see [1].

MPI is a little more language/environment-agnostic than some alternatives - if a little overdesigned (to the point that sometimes makes it a bit hairy to use, though usually just in terms of config that are for your sysadmin to worry about).


When/why MPI?

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Some notes on MPI structure and jargon

Examples

Hello world

A minimal hello world is:

#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv) {
  int ierr;

  MPI_Init(*argc,**argv);   // returns error code, here we assume it won't have trouble
  printf("hello world\n");
  MPI_Finalize();           // ditto on the error code
}

As part of the init (which gets some details via the fact that you'll start it via mpirun with something like -np 4), one process takes care of starting the others, which means that all of them print hello world.


When you want to tell the processes apart, or communicate, you'll probably want to fetch and report the rank (and often size)

Distributing jobs

MPI config trouble

A requested component was not found, or was unable to be opened

At different levels, the lines

ompi_rte_init failed --> Returned "Not found" (-13) 
ORTE_ERROR_LOG: Not found in file runtime/orte_init.c

seem to mean the same thing.


ompi_info should list at least a few dozen MCA lines. If it shows only a few, and the following error, then you've probably got a mix of openmpi versions installed, and paths (binary/library) effectively mixing them.

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      myhostname
Framework: crs
Component: none

However, this is not always a good test when specific software may do things in its own environment.

Keyval parser: error 1 reading file mpicc-wrapper-data.txt

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Seems to not be fatal (compilation works), but presumably it won't use the configuration.


Seems to typically be a bad openmpi config (possibly because of updates, mixing versions, etc). I have no idea in which way it's bad, and it seems that since the problem is rare no one has really figured that out.

A fairly thorough way to fix it is to uninstall MPI, clean any leftovers (e.g. anything that locate -e openmpi mentions that looks like config files), and reinstall(/rebuild).


Runtime MPI Errors

For help debugging programs using MPI, see http://www.open-mpi.org/faq/?category=debugging and similar, and some of the below.


mca_btl_tcp_endpoint_complete_connect] connect() to ipaddress failed: No route to host

Boils down to a generic TCP "No route to host" error (which is what errno 113 refers to).

Check NIC status and link lights (e.g. ethtool).

Check routing rules (e.g. netstat -r).


Error 137

Seems to be a generic out-of-resource error.

In our case it came from a process that allocated more than it should have, and was oom_killled


MPI_ABORT was invoked on rank 0 in communicator mpi_comm_world

Regularly


See also

Notes on threading implementations (lower levels)

tl;dr

windows and posix threads

On hardware

APIs and implementations

On threads and cores

Notes on thinking about threads and other parallelism (higher level)

Unsorted