Benchmarking, performance testing, load testing, stress testing, etc.

From Helpful
Jump to navigation Jump to search

Some fragmented programming-related notes, not meant as introduction or tutorial

Data: Numbers in computers ·· Computer dates and times ·· Data structures

Wider abstractions: Programming language typology and glossary · Generics and templating ·· Some abstractions around programming · · Computational complexity theory notes · Synchronous, asynchronous · First-class citizen

Syntaxy abstractions: Constness · Memory aliasing · Binding, assignment, and such · Hoisting · Closures · Context manager · Garbage collection

Sharing stuff: Communicated state and calls · Locking, data versioning, concurrency, and larger-scale computing notes ·· Dependency hell

Language specific: Python notes ·· C and C++ notes · Compiling and linking ·· Lua notes

Teams and products: Programming in teams, working on larger systems, keeping code healthy · Benchmarking, performance testing, load testing, stress testing, etc. · Maintainability

More applied notes: Optimized number crunching · File polling, event notification · Webdev · GUI toolkit notes · StringBuilder

Mechanics of duct taping software together: Automation, remote management, configuration management · Build tool notes · Installers

Some names

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Load testing is a name that groups various others, broadly meaning 'put load on a system and see what happens', yet often your goals is more specific, so the name is too.


  • stress testing: find if it stays well behaved under heavy load
e.g. disconvering issues related to resource contention, memory leaks, timeouts, overflows, races, non-elegant failure, etc.
  • longevity/endurance testing: find if it stays well-behaved over time, probably with a constant load of jobs
(doesn't necessarily test stress conditions)
should e.g. reveal memory leaks, things like counter overflow bugs
  • soak testing: somewhere between longevity testing and
see if a realistic production load reveals bugs that targeted tests do not - like resource leaks
  • performance testing: push it until it won't go faster, to see how fast that is, often related to:
finding the minimum expectable time to handle any unit of work (e.g. with near-empty tasks)
finding whether there are unusual variations in that time, and if so where they come from
finding the maximum rate the system handles for real-world load
check whether you get reasonable response under expectable average load (black box style - not yet caring why or why not)
finding the maximum rate the system handles under unusually complex load
finding which parts are the bottleneck
also to eliminate or relieve them where possible (not so black-box; may involve a lot of detailed system diagnosis)
finding a baseline speed to compare against in future tests, to see changes in speed
  • see what effect code/feature changes have on any of this


  • One useful distinction:
performance testing is seeing how fast it can go,
stress testing is seeing whether it does something unexpected while doing that.
  • The "how fast can it go" part is, compared to most other tests, more of an artform to do meaningfully, because specific benchmarks represent specific uses, hidden assumptions, are often unrealistic workloads, don't test degradation under load, integrated parts, etc.
(In some cases done primarily to please your PR department - "Lies, damned lies, and benchmarks")

Longevity/endurance testing, soak testing

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Longevity testing means running the system for a longer time, to evaluate system stability under production use.

It often means giving it a basic load to deal with during that time, not to stress it but enough to e.g. reveal memory leaks within reasonable time.


  • Endurance testing, capacity testing, and some other names are basically the same thing.
  • in modern release-often cycles, this is often replaced with a some extrapolation and assumptions
or, if the only purpose is finding memory leaks, a test that tries to evoke them specifically

Soak testing is very similar in that it also asks you to run it for a long time if you can, but specifically asks you to try to make load simulates real production load, not necessarily stress testing but enough load to reveal issues like memory leaks, hidden problems in transaction processing.

The term seems borrowed from electronics, where it's testing it above e.g. temperature ratings for a while to see when and how it fails.

You might go further and think about the contents of tasks you submit, e.g. some repetition and some fuzz testing to this, to reveal hotspots and such.

Stress testing, recovery testing

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Stress testing is basically seeing if you can break the system by attempting to overwhelm its resources.

Often done to see whether a system will merely slow, fail on specific tasks, or fall over in general.

Recovery testing often amounts to stress testing plus seeing what happens what happens when resources are taken away, simulating partial failure. Consider not only CPU but also IO, services, connectivity, memory.

Useful in general but perhaps more so in cloudy arrangements.

Often done to see

  • whether it will recover from specific resource failures at all - seeing if it becomes slower (preferable) or just falls over.
  • whether stress during recovery will lead to repeated failure

Common pitfalls in benchmarking

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Measuring latency / client rate when you wanted to measure system rate

Network clients that do things sequentially are implicitly limited by network latency, because that is the amount they will sit around just waiting.

So a client's latency is oftne limiting factor to a single client's interaction rate, but in a (properly written) server this says nothing about what the server can do.

It may be that any single client can get no more than 50 req/s (because it was elsewhere on the internet), but the system you're requesting things from could reach 5000 req/s if only you pointed enough distinct clients at it.

Measuring rate when you wanted to measure latency

Say you do 20 small writes to a hard drive.

End-to-end, the act of going to platter disk might take 10ms for each operation.

Yet if your set of writes was optimizable (e.g. very sequential in nature so they could get merged) it's possible that a set of operations take only a little more than 10ms overall.

If you then divide that 10ms by 20 writes and say say each operation took 0.5ms, you will be very disappointed when doing each individually takes 10ms.

That 0.5ms figure may numerically the average, but is almost meaningless in modelling operations and their performance.

Measuring the overhead


  • Overhead is very important only when it's on the same order of magnitude as the real work.
  • Overhead is probably insignificant if it's an order of magnitude lower

Let's say you are testing how fast your web server requests can go.

One framework points out it can do 15000 req/sec, which is clearly multiples faster than the other one that goes at a mere 5000 req/sec. For shame!

...except via some back-of-napkin calculations, I would actually expect that to be less than a percent faster in any real-world application.


Because the difference between those two rates is on the order of 0.1 milliseconds.

Maybe those microseconds are spent doing some useful utility work, that your hello world test doesn't need, but almost every real app typically does. Maybe the faster one is a bare-bones you have to do absolutely everything yourself framework, and when you do, there is no longer a difference. In fact, it might even be faster at that extra work, and the only criticism is that it apparently puts it in there by default. You don't know.

But even if we assume is 100 microseconds of pure waste - consider that many real web requests take on the order of 10ms, because that's how long the calculation or IO for something remotely useful tends to take.

So now we're talking about the difference in responding in 10ms and 10.1ms. That difference is barely measurable, it falls away in the latency jitter.

If I cared about speed, it would be my code I should be examining, because I can almost certainly improve it by lot more than 100 microseconds.

Measuring the startup, hardware tuning, etc.

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The more complex the thing being executed, the more likely it is that something is being loaded dynamically.

Say, if there's something as complex as tensorflow under the covers, then maybe the first loop spends 20 seconds loading a hundreds-of-MByte model in to the GPU, maybe doing some up-front autotuning, and all successive loops might take 10ms to run.

Even if you pre-loaded the model, the first loop may still be 100ms..1s because of the autotuning or just because some further things weren't loaded.

Or even during the first iterations - some libraries might optimize for your hardware, at a one-time cost.

Or even during later ones - JIT compilers will often do some analysis and tweaking beyond the first iterations.

The low-brainer workaround is often just to run a bunch of loops on real-ish data, before you start actually timing things.

Timing inaccuracies

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

It's easy to do:

time_started = now()
time_taken = now() - time_started

...and this is perfectly valid.

Yet the shorter that do_stuff takes, the more it matters how fine-grained this now() function is, and how much overhead there is in calling that now().

The cruder and the more overhead, the less that time_taken means anything at all.

For example

  • python's time.time() was meant to give the seconds-since-epoch timestamp, it just happens to do it roughly as as precisely as it can on a platform
...but assume that it's no better than ~10ms on windows (it can be better, but don't count on it), and ~1ms on most linux and OSX
there is a _ns variant of time() (and various others), introduced around py3.7
introduced to deal with precision loss due to time() returning a float, by returning an int instead
note that while it gives nanosecond resolution, it does not promise that precision
the precision might be down to the 100ns..250ns range on linux(verify). It's still 1ms on windows. (also note that at this scale, python execution overhead matters a lot)

Measuring your cache more than your code

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Okay, you've read the timing inaccuracies, and wrote "repeat this small code fragment a million times and/or at least a few seconds, and divide the time taken by the amount of runs"

That is better, in that it makes the inaccuracies in the timing around that code negligible.

It is also potentially worse, in that you now easily run into another issue:

The smaller that code fragment is, the more likely it is that running it as a tight loop means it gets the bonus speed from various caches (L1, L2, RAM, page cache, memcache) - a boost that you would never get if this is occasionally-run code rather than an inner loop.

When optimizing inner-loop number crunching, this may be sensible, but if this code is never run that way in practice, you are now really testing your cache, and not even realistically at that (because of access patterns).

When your aim was to to get an idea of speed of running it once in everyday use, you just tested basically nothing.

Also, shaving a millisecond off something that rarely gets run probably isn't worth your time.

This isn't even a fault in the execution of a test, it's a fault in the setup of that test, and in thinking about what that test represents.


See also