Programming in teams, working on larger systems, keeping code healthy
Keeping codebases healthy
Testing in teams and larger scales
Refactoring
Tech debt and code debt
Tech debt (/Technical debt) refers to any decision that is easier now but makes for the same work and more later .
Tech debt usually comes up when you decide to apply a simplified, quick-fix, incomplete solution now.
Often, but not necessarily, there is a sense that the future (re)work
will be harder and more time consuming than it is to do it properly now
(because more things will rely on it by then).
So this lies on a scale between
- quick fix now, polish it up later
- some double work between now and then
- the work later will probably be a complete restructure
Whether committing to tech debt is a good or bad idea depends on context.
Sometimes, the the ability to get other people started is worth some extra hours spent overall.
Yet there is a very good argument for more effort up front when.
- postponing mean you know that later change will have to be a complete redesign,
- and/or will be a core part other things build on,
- and/or the complexity now is not actually much lower
Economic comparison
Code debt
Permeation debt
Software rot and refactoring
Everything is experimental
Limits of automated testing
Which tests are most interesting, for which type of projects, and why?
Types of tests
There are more possible names for tests than most of us can remember or define on the spot.
And some are fuzzily defined, or treated as interchangeable in some contexts.
Here is some context for some of them.
Code testing (roughly smaller to larger)
Narrow-scope tests (unit test)
"is this piece of code behaving sanely on its own, according to the tests I've thought up?"
Typically for small pieces of code, often functions, behaviour within classes, etc.
Note that the narrowness is about the code being validated, not about the extent of code being executed to do so (this also relates to the 'how much to mock' discussion).
Upsides:
- can be a form of self-documentation
- example cases are often unit tests as well
- and say, if I want to dig into details, function description saying "percent-escapes for URI use" may tell me less than assert uri_component('http://example.com:8080/foo#bar') == 'http%3A%2F%2Fexample.com%3A8080%2Ffoo%23bar'
- can be particularly helpful in helper/library functions, because more other code relies on it
- a subset of unit tests are part of regression testing - "this part is fragile, and we expect future tweaks may break this again"
- forces you to think about edge cases at the most basic level
- ...around more dynamic programming
- ...around more dynamic typing
- doing this sooner rather than later avoids some predictable mistakes
- sometimes you discover edge cases you didn't think, didn't implement correctly, and/or didn't describe very precisely, just by writing tests
- easily overstated, but at the same time, probably everyone has done this
Arguables:
- if you didn't think of a problem when writing your code, you may also not think of it in your tests.
- alleviated by shared tests
- the throughness you have gives you more confidence -- the thoroughness you forget will give you a false sense of confidence
- The more OO code dabbles in layers of abstractions, the more black-box it is, and the harder it is to say what or how much a test really does
- a lot of real-world bugs sit in interactions of different code, and unit tests do not test that at all
- sure, that's not their function, point is that we that 'write tests' often leads to writing unit tests, not on finding bugs
- while on paper the "try your hardest to think of everything that would break it" idea is great
- ...if you look around, a buttload of unit tests are of the "think of things you know probably work anyway" sort
- ...because a lot of people write unit tests only because someone told them sternly (often by someone who barely understands when they are useful and when not)
Downsides:
- if it involves locking, IPC, network communication, or concurrency, or interact with other parts of the program that has state (think OO), or other programs that have state, the less you really test - or can even say what you have tested or not.
- such things are hard to test even with much fancier techniques
- there is no good measure of how thorough your unit tests are
- if you think code coverage is that thing, you are probably a manager, not a programmer.
- the less dynamic the behaviour, the more that unit testing converges on testing if 1 is still equal to 1
- which wastes time
- which can give a false sense of security
- the more dynamic the behaviour (in the 'execution depends on the actual input' sense), the less that adding a few tests actually prove correctness at all
- In fact, tests rarely prove correctness to begin with (because this is an extremely hard thing to do), even in the forms of TDD that TDDers would find overzealous
- most of the time, they only prove you didn't make the more obvious mistakes that you thought of
Regression testing
"Is this still working as it always was / doing what it always did / not doing a bad thing we had previously patched up?"
Refers to any kind of test that should stay true over time.
- particularly when you expect that code to be often touched/altered,
- particularly when that code is used (implicitly) by a lot of the codebase, so bugs (or breaking changes, or false assumptions) are far reaching
Yes, any test that you do not throw away acts as a sort of regression test,
but when we call it this, it more specifically often means "a test we wrote when we fixed a nasty bug, to ensure we won't regress to that bug later" - hence the name.
Regression tests are often as simple as they need to be, and frequently a smallish set of unit tests is enough.
Upsides:
- should avoid such bug regression well
- may also help avoid emergence of similar bugs.
Arguables / downsides:
- the same specificity that avoids that regression means it's covering very little else
- ...even similar issues in the same code
- which can lead to a false sense of security
Integration testing
"Does this code interact sanely with the other code / parts of the program?"
This is one or more steps up from unit tests - it takes units and smashes them together, to see if they keep working properly and interact sensibly -- rather than just work in complete isolation.
Integration tests are often the medium-sized tests you can do during general development.
You can argue they are more interesting than unit tests at finding broad problems quickly, but at the same time they may not be great at isolating them, or at finding edge cases.
In any case, they usually stop far before testing the product as a whole. (that tends to be later in the process, if at all -- these continuous-delivery days sometimes mean "user tests means deploying to users and see if they complain, right?")
Fuzz testing
Fuzz testing, a.k.a. fuzzing, feeds in what is often largely random data, or random variation of existing data.
If software does anything other than complain about bad input, it may reveal bordercases you're not considering, and e.g. the presence of exploitable buffer overflows, injection vectors, ability to DoS, bottlenecks, etc.
Perhaps used more in security reviews, but also in some tests for robustness.
Can apply
- for relatively small bits of code, e.g. "add random number generator to unit tests and see if it breaks",
- up to "feed stuff into the GUI field and see if it breks".
See also:
Acceptance testing
"Are we living up to the specific list of requirements in that document over there?"
Said document classically said 'functional design' at the top.
In agile, it's probably the collection of things labeled 'user stories'.
Which can involve any type of test, though in many cases is a fairly minimal set of tests its overall function,
and basic user interaction, and is largely unrelated to bug testing, security testing, or such.
These draw in some criticism, for various reasons.
A design document tends to have an overly narrow view of what really needs to be tested. You're not necessarily testing whether the whole actually... functions, or even acts as people expect.
The more formally it's treated, the less valued it is when people do their own useful tests.
End-to-End testing
Basically, testing whether the flow of an application works, basically with a 'simulated user.
The goal is still to test the application at mostly functional level - whether information is passed between distinct components, whether database, network, hardware, and other dependencies act as expected.
End-to-end testing is often still quite mechanical, and you might spend time specify a bunch of test cases and expected to cover likely uses and likely bugs.
This is in some ways an extension of integration testing, at a whole-application and real-world interaction level, finding bugs
While you are creating a setup as real users might see it,
It's not e.g. letting users loose and see what they break.
Tests in short release cycles
Sanity testing, Smoke testing
Caring about users
Usability testing
Accessibility testing
Load, performance, stress testing
Broadly:
- Load testing is a name that groups various others, broadly meaning 'put load on a system and see what happens'.
- Performance testing:
- seeing how fast it goes
- Stress and endurance testing:
- seeing whether it does something unexpected while doing that.
In any of these areas, as our goals get more specific, the names do too.
Longevity/endurance testing, soak testing
Longevity testing find if it stays well-behaved over time, to evaluate system stability under production use.
Doesn't necessarily test stress conditions, yet gives it a reasonable load to deal with, and run it long enough to reveal things like memory leaks, hidden problems in transaction processing, locking issues, counter overflows, and whatever other things a sequence of unit tests can't or won't easily test for.
Notes:
- Endurance testing, capacity testing, and some other names are flavour variations of this
- in modern release-often cycles, longevity testing takes too long, and is often replaced with a some extrapolation and assumptions
- or, if the only purpose is finding memory leaks, a test that tries to evoke issues more specifically
Soak testing is similar in that it also asks you to run it for a long time if you can, yet specifically asks to simulate a real production load.
figuring that this may reveal issues that targeted (and usually short term) tests do not test for.
Some people add a sense of "make the tasks realistic rather than just numerous enough", but people vary on the details.
You might go further and think about the contents of tasks you submit, e.g. some repetition and some fuzz testing to this, to reveal hotspots and such.
The term seems borrowed from electronics, where soak testing means testing it outside of ratings (e.g. temperature) for a while to see when and how it fails.
Stress testing, recovery testing
Stress testing is basically seeing if you can break the system by attempting to overwhelm its resources.
Such conditions may bring up issues related to resource contention, memory leaks, timeouts, overflows, races, non-elegant failure, etc. that a relatively idle system may never meet.
Often done to see whether a system will merely slow down (which is to be expected), or will also start to fail on specific tasks, or become less stable in other ways.
Recovery testing often amounts to stress simulating limited certain resources, like CPU but also IO, services, connectivity, memory, often to see
- whether it will recover from specific resource failures at all - seeing if it becomes slow or just falls over.
- whether stress during recovery will lead to repeated failure
- whether it will back go to a useful state afterwards
Useful in general, and perhaps even more so in modern cloudy arrangements because on someone else's virtualized, shared, infrastructure it is harder to estimate resource limits.
Performance testing
So, there is a whole area of "how fast will it go" testing.
This is an artform to do meaningfully more than many other kinds of tests.
In part because every benchmarks represent specific uses and hidden assumptions. Many performance tests are completely unrealistic workloads, don't test degradation under load, and in some cases done primarily to please your PR department - you know "Lies, damned lies, and benchmarks".
performance testing: push it until it won't go faster, to see how fast that is, often related to:
- finding the minimum expectable time to handle any unit of work (e.g. with near-empty tasks)
- finding whether there are unusual variations in that time, and if so where they come from
- finding the maximum rate the system handles for real-world load
- check whether you get reasonable response under expectable average load (black box style - not yet caring why or why not)
- finding the maximum rate the system handles under unusually complex load
- finding which parts are the bottleneck
- also to eliminate or relieve them where possible (not so black-box; may involve a lot of detailed system diagnosis)
- finding a baseline speed to compare against in future tests, to see changes in speed
Common pitfalls in benchmarking
Measuring latency / client rate when you wanted to measure system rate
Network clients that do things sequentially are implicitly limited by network latency, because that is the amount they will sit around just waiting.
So a client's latency is oftne limiting factor to a single client's interaction rate, but in a (properly written) server this says nothing about what the server can do.
It may be that any single client can get no more than 50 req/s (because it was elsewhere on the internet), but the system you're requesting things from could reach 5000 req/s if only you pointed enough distinct clients at it.
Measuring rate when you wanted to measure latency
Say you do 20 small writes to a hard drive.
End-to-end, the act of going to platter disk might take 10ms for each operation.
Yet if your set of writes was optimizable (e.g. very sequential in nature so they could get merged) it's possible that a set of operations take only a little more than 10ms overall.
If you then divide that 10ms by 20 writes and say say each operation took 0.5ms, you will be very disappointed when doing each individually takes 10ms.
That 0.5ms figure may numerically the average, but is almost meaningless in modelling operations and their performance.
Measuring the overhead
tl;dr:
- Overhead is very important only when it's on the same order of magnitude as the real work.
- Overhead is probably insignificant if it's an order of magnitude lower
Let's say you are testing how fast your web server requests can go.
One framework points out it can do 15000 req/sec, which is clearly multiples faster than the other one that goes at a mere 5000 req/sec. For shame!
...except via some back-of-napkin calculations, I would actually expect that to be less than a percent faster in any real-world application.
Why?
Because the difference between those two rates is on the order of 0.1 milliseconds.
Maybe those microseconds are spent doing some useful utility work, that your hello world test doesn't need, but almost every real app typically does.
Maybe the faster one is a bare-bones you have to do absolutely everything yourself framework, and when you do, there is no longer a difference.
In fact, it might even be faster at that extra work, and the only criticism is that it apparently puts it in there by default.
You don't know.
But even if we assume is 100 microseconds of pure waste - consider that many real web requests take on the order of 10ms,
because that's how long the calculation or IO for something remotely useful tends to take.
So now we're talking about the difference in responding in 10ms and 10.1ms. That difference is barely measurable, it falls away in the latency jitter.
If I cared about speed, it would be my code I should be examining, because I can almost certainly improve it by lot more than 100 microseconds.
Measuring the startup, hardware tuning, etc.
The more complex the thing being executed, the more likely it is that something is being loaded dynamically.
Say, if there's something as complex as tensorflow under the covers, then maybe the first loop spends 20 seconds loading a hundreds-of-MByte model in to the GPU, maybe doing some up-front autotuning, and all successive loops might take 10ms to run.
Even if you pre-loaded the model, the first loop may still be 100ms..1s because of the autotuning or just because some further things weren't loaded.
Or even during the first iterations - some libraries might optimize for your hardware, at a one-time cost.
Or even during later ones - JIT compilers will often do some analysis and tweaking beyond the first iterations.
The low-brainer workaround is often just to run a bunch of loops on real-ish data, before you start actually timing things.
Timing inaccuracies
It's easy to do:
time_started = now() do_stuff() time_taken = now() - time_started
...and this is perfectly valid.
Yet the shorter that do_stuff takes, the more it matters how fine-grained this now() function is, and how much overhead there is in calling that now().
The cruder and the more overhead, the less that time_taken means anything at all.
For example
- python's time.time() was meant to give the seconds-since-epoch timestamp, it just happens to do it roughly as as precisely as it can on a platform
- ...but assume that it's no better than ~10ms on windows (it can be better, but don't count on it), and ~1ms on most linux and OSX
- there is a _ns variant of time() (and various others), introduced around py3.7
- introduced to deal with precision loss due to time() returning a float, by returning an int instead
- note that while it gives nanosecond resolution, it does not promise that precision
- the precision might be down to the 100ns..250ns range on linux(verify). It's still 1ms on windows. (also note that at this scale, python execution overhead matters a lot)
Measuring your cache more than your code
Okay, you've read the timing inaccuracies, and wrote "repeat this small code fragment a million times and/or at least a few seconds, and divide the time taken by the amount of runs"
That is better, in that it makes the inaccuracies in the timing around that code negligible.
It is also potentially worse, in that you now easily run into another issue:
The smaller that code fragment is, the more likely it is that running it as a tight loop means it gets the bonus speed from various caches (L1, L2, RAM, page cache, memcache) - a boost that you would never get if this is never actually run in an inner loop.
When optimizing inner-loop number crunching, this may be sensible,
but if this code is never run that way in practice,
you are now testing your cache more than your code
(and not even realistically at that, because of access patterns).
When your aim was to to get an idea of speed of running it once in everyday use, you just didn't do that.
Also, shaving milliseconds off something that rarely gets run
probably isn't worth your time in the first place.
This isn't even a fault in the execution of a test, it's a fault in the setup of that test,
and in thinking about what that test represents.
Micro-benchmarking
See also
- http://agiletesting.blogspot.com/2005/02/performance-vs-load-vs-stress-testing.html
- http://www.soft.com/News/QTN-Online/qtnsep02.html
http://en.wikipedia.org/wiki/System_testing
Also relevant
Black box versus white-box
Self-testing code
Self-testing code is code that includes some checks inside its own code.
This often amounts to mean
- assert() statements within a function, e.g.
- testing important invariants
- doing your own regression checks
- intentionally borking out earlier rather than later when a bug could have wide-reaching implications (e.g. around concurrency)
https://en.wikipedia.org/wiki/Self-testing_code