Programming in teams, working on larger systems, keeping code healthy
Keeping codebases healthy
Testing in teams and larger scales
Integration
Refactoring
Code smells
Tech debt and code debt
Technical debt (a.k.a. tech debt) refers to any decision that is easier now, but probable means more work in the future to do it right.
Tech debt usually comes up when you decide to apply a simplified, quick-fix, incomplete solution now.
This is not about optimizations or things that could be considered black box implementations.
It's about things that will have a structural effect for some reason, often because other code that interacts with this will have to follow along.
This is related to the sense that the inevitable future (re)work will be harder,
and more time-consuming, than it is to do it properly now (because more things will rely on it by then).
So this lies on a scale between
- quick fix now, polish it up later,
- some double work between now and then,
- the work later will probably be a complete restructure, and
- that's so much work it will never happen
Sometimes,
the the ability to get other people started
is worth some extra hours spent overall.
Yet there is a very good argument for more effort up front when.
- postponing mean you know that later change will have to be a complete redesign,
- and/or will be a core part other things build on,
- and/or the complexity now is not actually much lower
Whether committing to tech debt is a good or bad idea depends on context.
Where that tipping point even lies even depends. In startup world, it arguably doesn't even exist, in that not committing to rewriting next year may mean you won't make it to next year.
Economic comparison
Code debt
Permeation debt
Software rot and refactoring
Everything is experimental
Limits of automated testing
Which tests are most interesting, for which type of projects, and why?
Types of tests
There are more possible names for tests than most of us can remember or define on the spot.
And some are fuzzily defined, or treated as interchangeable in some contexts.
Here is some context for some of them.
Code testing (roughly smaller to larger)
Narrow-scope tests (unit test)
"is this piece of code behaving sanely on its own, according to the tests I've thought up?"
Typically for small pieces of code, often functions, behaviour within classes, etc.
Note that the narrowness is about the code being validated, not about the extent of code being executed to do so (this also relates to the 'how much to mock' discussion).
Upsides:
- can be a form of self-documentation
- example cases are often unit tests as well
- and say, if I want to dig into details, function description saying "percent-escapes for URI use" may tell me less than assert uri_component('http://example.com:8080/foo#bar') == 'http%3A%2F%2Fexample.com%3A8080%2Ffoo%23bar'
- forces you to think about edge cases - and what to do with them
- ...useful around more dynamic programming
- ...useful around more dynamic typing
- doing this sooner rather than later avoids some predictable mistakes
- can be particularly helpful in helper/library functions, because more other code relies on it
- a subset of unit tests are part of regression testing - "this part is fragile, and we expect future tweaks may break this again"
- sometimes you discover edge cases you didn't think about, didn't implement correctly, and/or didn't describe very precisely, just by writing tests
- easily underestimated and easily overestimated
Arguables:
- if you didn't think of a problem when writing your code, you may also not think of it in your tests.
- alleviated by shared tests
- the thoroughness you actually have gives you more confidence -- the thoroughness you forget will make that a false sense of confidence
- The more that OO code dabbles in layers of abstractions, the more black-box it is, and the harder it is to say what or how much a test really does
- a lot of real-world bugs sit in interactions of different code, and unit tests do not test that at all
- sure, that's not their function, point is that we that 'write tests' often leads to writing unit tests, not on finding bugs
- while on paper the "try your hardest to think of everything that would break it" idea is great
- ...if you look around, a buttload of unit tests are of the "think of things you know probably work anyway" sort
- ...because a lot of people write unit tests only because someone told them sternly (often by someone who barely understands when they are useful and when not)
Downsides:
- if it involves locking,
- or IPC,
- or network communication,
- or concurrency,
- or interacts with other stateful parts of the program,
- or other programs that have state,
- ...then you are not necessarily doing much of a test, and it makes it hard to say what you have even tested
- such things are hard to test even with much fancier techniques
- there is no good measure of how thorough your unit tests are
- if you think code coverage is that thing, you are probably a manager, not a programmer.
- the less dynamic the behaviour, the more that unit testing converges on testing if 1 is still equal to 1
- which wastes time
- which can give a false sense of security
- the more dynamic the behaviour (in the 'execution depends on the actual input' sense), the less that adding a few tests actually prove correctness at all
- In fact, tests rarely prove correctness to begin with (because this is an extremely hard thing to do), even in the forms of TDD that TDDers would find overzealous
- most of the time, they only prove you didn't make the more obvious mistakes that you thought of
- In fact, tests rarely prove correctness to begin with (because this is an extremely hard thing to do), even in the forms of TDD that TDDers would find overzealous
Regression testing
"Is this still working as it always was / doing what it always did / not doing a bad thing we had previously patched up?"
Refers to any kind of test that should stay true over time.
- particularly when you expect that code to be often touched/altered,
- particularly when that code is used (implicitly) by a lot of the codebase, so bugs (or breaking changes, or false assumptions) are far reaching
Yes, any test that you do not throw away acts as a sort of regression test,
but when we call it this, it more specifically often means "a test we wrote when we fixed a nasty bug, to ensure we won't regress to that bug later" - hence the name.
Regression tests are often as simple as they need to be, and frequently a smallish set of unit tests is enough.
Upsides:
- should avoid such bug regression well
- may also help avoid emergence of similar bugs.
Arguables / downsides:
- the same specificity that avoids that regression means it's covering very little else
- ...even similar issues in the same code
- which can lead to a false sense of security
Integration testing
"Does this code interact sanely with the other code / parts of the program?"
This is one or more steps up from unit tests - it takes units and smashes them together, to see if they keep working properly and interact sensibly -- rather than just work in complete isolation.
Integration tests are often the medium-sized tests you can do during general development.
You can argue they are more interesting than unit tests at finding broad problems quickly, but at the same time they may not be great at isolating them, or at finding edge cases.
In any case, they usually stop far before testing the product as a whole. (that tends to be later in the process, if at all -- these continuous-delivery days sometimes mean "user tests means deploying to users and see if they complain, right?")
Fuzz testing
Fuzz testing, a.k.a. fuzzing, feeds in what is often largely random data, or random variation of existing data.
If software does anything other than complain about bad input, it may reveal bordercases you're not considering, and e.g. the presence of exploitable buffer overflows, injection vectors, ability to DoS, bottlenecks, etc.
Perhaps used more in security reviews, but also in some tests for robustness.
Can apply
- for relatively small bits of code, e.g. "add random number generator to unit tests and see if it breaks",
- up to "feed stuff into the GUI field and see if it breks".
See also:
Acceptance testing
"Are we living up to the specific list of requirements in that document over there?"
Said document classically said 'functional design' at the top.
In agile, it's probably the collection of things labeled 'user stories'.
Which can involve any type of test, though in many cases is a fairly minimal set of tests its overall function,
and basic user interaction, and is largely unrelated to bug testing, security testing, or such.
These draw in some criticism, for various reasons.
A design document tends to have an overly narrow view of what really needs to be tested. You're not necessarily testing whether the whole actually... functions, or even acts as people expect.
The more formally it's treated, the less valued it is when people do their own useful tests.
End-to-End testing
Basically, testing whether the flow of an application works, basically with a 'simulated user.
The goal is still to test the application at mostly functional level - whether information is passed between distinct components, whether database, network, hardware, and other dependencies act as expected.
End-to-end testing is often still quite mechanical, and you might spend time specify a bunch of test cases and expected to cover likely uses and likely bugs.
This is in some ways an extension of integration testing, at a whole-application and real-world interaction level, finding bugs
While you are creating a setup as real users might see it,
It's not e.g. letting users loose and see what they break.
Tests in short release cycles
Sanity testing, Smoke testing
Caring about users
Usability testing
Accessibility testing
Load, performance, stress testing
Broadly:
- Load testing is a name that groups various others, broadly meaning 'put load on a system and see what happens'.
- Performance testing:
- seeing how fast it goes
- Stress and endurance testing:
- seeing whether it does something unexpected while doing that.
In any of these areas, as our goals get more specific, the names do too.
Longevity/endurance testing, soak testing
Longevity testing find if it stays well-behaved over time, to evaluate system stability under production use - acknowledging e.g. that targeted tests are often short term.
Doesn't necessarily test stress conditions, yet gives it a reasonable load to deal with, and run it long enough to reveal things like memory leaks, hidden problems in transaction processing, locking issues, counter overflows, and whatever other things a sequence of unit tests can't or won't easily test for.
Longevity testing is moderately specific, but can also be considered an umbrella of some other more specific things.
For example, soak testing can be considered a type of longevity testing, focused more on stability, and less on e.g. performance - but applying above-average stress is probably useful to reveal some kinds of stability issues (The term seems borrowed from electronics, where soak testing means testing it somewhat outside of ratings (e.g. temperature) for a longer while, to see when and how it fails)
For any flavour, it is potentially useful to simulate a realistic production load, rather than just numerous enough load.
You might go further and think about the contents of tasks you submit, e.g. some repetition and some fuzz testing and some interelated testing, reveal hotspots and locking issues and such.
Depending on the kind of system you're testing, that may actually be nontrivial to do well.
Notes:
- Endurance testing, capacity testing, and some other names are flavour variations of this
- in modern release-often cycles, longevity testing takes too long, and is often replaced with a some extrapolation and assumptions
- or, if the only purpose is finding memory leaks, a test that tries to evoke issues more specifically
Stress testing, recovery testing
Stress testing is basically seeing if you can break the system by attempting to overwhelm its resources.
Such conditions may bring up issues related to resource contention, memory leaks, timeouts, hardcoded limits, overflows, races, non-elegant failure, etc. that a relatively idle system may never meet.
Often done to see whether a system will merely slow down (which is to be expected), or will also start to fail on specific tasks, or become less stable in other ways.
Recovery testing often amounts to stress simulating limited certain resources, like CPU but also IO, services, connectivity, memory, often to see...
- ...whether it will recover from specific resource failures at all - seeing if it becomes slow or just falls over.
- ...whether stress during recovery will lead to repeated failure
- ...whether the system as a whole will back go to a useful state afterwards
- (...and sometimes whether e.g. tasks active at the time will be forgotten in the process)
Useful in general, and perhaps even more so in modern cloudy arrangements because on someone else's virtualized, shared, infrastructure it is harder to estimate the actual resource limits.
Performance testing
So, there is a whole area of "how fast will it go" testing.
This is an artform to do meaningfully more than many other kinds of tests.
In part because every benchmarks represent specific uses and hidden assumptions. Many performance tests are completely unrealistic workloads, don't test degradation under load, and in some cases done primarily to please your PR department - you know "Lies, damned lies, and benchmarks".
performance testing: push it until it won't go faster, to see how fast that is, often related to:
- finding the minimum expectable time to handle any unit of work (e.g. with near-empty tasks)
- finding whether there are unusual variations in that time, and if so where they come from
- finding the maximum rate the system handles for real-world load
- check whether you get reasonable response under expectable average load (black box style - not yet caring why or why not)
- finding the maximum rate the system handles under unusually complex load
- finding which parts are the bottleneck
- also to eliminate or relieve them where possible (not so black-box; may involve a lot of detailed system diagnosis)
- finding a baseline speed to compare against in future tests, to see changes in speed
Common pitfalls in benchmarking
Measuring latency / client rate when you wanted to measure system rate
Network clients that do things sequentially are implicitly limited by network latency, because that is the amount they will sit around just waiting.
So a single client's latency is often limiting factor to a single client's interaction rate, but says almost nothing about what the server can do.
It may be that any single client can get no more than 50 req/s to a single client (purely because it was elsewhere on the internet), but that same system could reach 5000 req/s if you pointed enough distinct clients at it.
Measuring rate when you wanted to measure latency
Say you do 20 small writes to a hard drive.
End-to-end, the act of going to platter disk might take 10ms for each operation.
Yet if your set of writes was optimizable (e.g. very sequential in nature so they could get merged) it's possible that a set of operations take only a little more than 10ms overall.
If you then divide that 10ms by 20 writes and say say each operation took 0.5ms, you will be very disappointed when doing just one still still takes 10ms each, an doing them in verified-sequence takes 200ms.
That 0.5ms figure may numerically the average, but is almost meaningless in modelling operations and their performance.
Measuring the overhead
tl;dr:
- Overhead is very important only when it's on the same order of magnitude as the real work.
- Overhead is probably insignificant if it's an order of magnitude lower
Let's say you are testing how fast your web server requests can go.
One framework points out it can do 15000 req/sec.
Which is clearly multiples faster than the other one that goes at a mere 5000 req/sec. For shame!
...except via some back-of-napkin calculations, I would actually expect that to be less than a percent faster in almost all real-world application.
Wait, why?
Because the difference between those two rates is on the order of 0.1 milliseconds.
Maybe those microseconds are spent doing some useful utility work, that your hello world test doesn't need, but almost every real app typically will.
Maybe the faster one is a bare-bones you have to do absolutely everything yourself framework, and when you do, there is no longer a difference.
In fact, it might even be faster at that extra work.
You don't know.
But even if we assume those 100 microseconds of pure waste - consider that many real web requests take on the order of 10ms,
because that's how long the calculation or IO for something remotely useful tends to take.
So now we're talking about the difference in responding in 10ms and 10.1ms. That difference is barely measurable, it falls away in the latency jitter.
If I cared about speed, it would be my code I should be examining, because I can almost certainly improve it by lot more than 100 microseconds.
Measuring the startup, hardware tuning, etc.
The more complex the thing being executed, the more likely it is that something is being loaded dynamically.
Say, if there's something as complex as tensorflow under the covers, then maybe the first loop spends 20 seconds loading a hundreds-of-MByte model in to the GPU, maybe doing some up-front autotuning, and all successive loops might take 10ms to run.
Even if you pre-loaded the model, the first loop may still be 100ms..1s because of the autotuning or just because some further things weren't loaded.
Or even during the first iterations - some libraries might optimize for your hardware, at a one-time cost.
Or even during later ones - JIT compilers will often do some analysis and tweaking beyond the first iterations.
The low-brainer workaround is often just to run a bunch of loops on real-ish data, before you start actually timing things.
Timing inaccuracies
It's easy to do:
time_started = now() do_stuff() time_taken = now() - time_started
...and this is perfectly valid.
Yet the shorter that do_stuff takes, the more it matters how fine-grained this now() function is, and how much overhead there is in calling that now().
The cruder and the more overhead, the less that time_taken means anything at all.
For example
- python's time.time() was meant to give the seconds-since-epoch timestamp, it just happens to do it roughly as as precisely as it can on a platform
- ...but assume that it's no better than ~10ms on windows (it can be better, but don't count on it), and ~1ms on most linux and OSX
- there is a _ns variant of time() (and various others), introduced around py3.7
- introduced to deal with precision loss due to time() returning a float, by returning an int instead
- note that while it gives nanosecond resolution, it does not promise that precision
- the precision might be down to the 100ns..250ns range on linux(verify). It's still 1ms on windows. (also note that at this scale, python execution overhead matters a lot)
Measuring your cache more than your code
Okay, you've read the timing inaccuracies, and wrote "repeat this small code fragment a million times and/or at least a few seconds, and divide the time taken by the amount of runs"
That is better, in that it makes the inaccuracies in the timing around that code negligible.
It is also potentially worse, in that you now easily run into another issue:
The smaller that code fragment is, the more likely it is that running it as a tight loop means it gets the bonus speed from various caches (L1, L2, RAM, page cache, memcache) - a boost that you would never get if this is never actually run in an inner loop.
When optimizing inner-loop number crunching, this may be sensible,
but if this code is never run that way in practice,
you are now testing your cache more than your code
(and not even realistically at that, because of access patterns).
When your aim was to to get an idea of speed of running it once in everyday use, you just didn't do that.
Also, shaving milliseconds off something that rarely gets run
probably isn't worth your time in the first place.
This isn't even a fault in the execution of a test, it's a fault in the setup of that test,
and in thinking about what that test represents.
Micro-benchmarking
See also
- http://agiletesting.blogspot.com/2005/02/performance-vs-load-vs-stress-testing.html
- http://www.soft.com/News/QTN-Online/qtnsep02.html
http://en.wikipedia.org/wiki/System_testing
Some named attitudes to code design
Software architectural styles
Design patterns
People like acronyms
DRY
Don't Repeat Yourself.
If you find yourself writing basically the same code twice,
ask yourself why that is, and whether making code reusable is worth the tradeoff in time spent,
structuring over time, avoiding future bugs from mismatches, etc.
People have thought of more complex abstractions to avoid repeating code,
but functions have always been the most consistent method to avoid repeating code.
OO classes are sometimes good to avoid code repeating code and/or repeating state management.
(though not as always also good at encapsulation as we like to pretend they are)
YAGNI
You Aren't Gonna Need It basically says "don't add functionality until you actually need it"
SOLID
SOLID seems to acknowledge that object orientation gives you incentives to do some things in a messy way.
So it points out some things to avoid.
Some parts of SOLID are things we've been saying since OO was introduced, but somehow seem to keep forgetting every couple of years.
Other parts are focused on explicit contract - sometimes overbearingly so.
The acronym itself is:
- Single-responsibility principle
- A class should have a single responsibility
- one way to make sure this stays true is to update code after updating specs/design on paper
- Open–closed principle
- "Software entities ... should be open for extension, but closed for modification."
- sort of an extension of the previous point
- Liskov substitution principle
- Basically that subtypes should act like their parents
- ...enough that an instance should be replacable with a subclass instance without breaking correctness.
- Interface segregation principle
- "Many client-specific interfaces are better than one general-purpose interface."
- Dependency inversion principle
- One should "depend upon abstractions, [not] concretions."
https://en.wikipedia.org/wiki/SOLID
GRASP
General Responsibility Assignment Software Patterns
https://en.wikipedia.org/wiki/GRASP_(object-oriented_design)
Also relevant
Black box versus white-box
Self-testing code
Self-testing code is code that includes some checks inside its own code.
This often amounts to mean
- assert() statements within a function, e.g.
- testing important invariants
- doing your own regression checks
- intentionally borking out earlier rather than later when a bug could have wide-reaching implications (e.g. around concurrency)
https://en.wikipedia.org/wiki/Self-testing_code
Mocking, monkey patching, fixtures
Mocking, fixtures, and monkey patching are all related to how we might prepare for a test to do what it is intended for.
Context for mocking:
Even many narrow-scope tests will rely on some other code, and that other code wasn't the point of the test - and may fail the test for reasons beyond what we were testing.
The point of mocking, then, is to make that other code look as if it is present -- but to not have it do anything that isn't required for the code under test.
If that other code is extensive, and we're not using any of it, we might do so by creating a mock object - has all the interface of the real thing, but doesn't do anything.
Mock objects / stubs / fake objects help that work.
- (When those replacements do nothing, they are often called stubs. Or fakes. But as wikipedia notes, the uses of these terms is highly inconsistent)
Mocking also refers to the process of putting them in place.
- Mocking also sometimes refers to setting up or faking other parts of the environment to make it possible to do the tests you need. Sometimes in the wider sense, such as installing it into an isolated environment.)
Monkey patching is any alteration done at runtime, whether that is for tests or not.
In the contexts of tests, a lot of monkey patching comes down to mocking at done at runtime.
While not at all unique to testing, it can be more common to testing.
...and has a sense of "maybe done later than would be the cleanest/sensible design in regular use, like setting an environment variable just before the test, or in dynamic languages maybe some late rebinding", sometimes so that it applies to only some uses of code.
Fixtures can be almost anything that makes tests easier.
A lot of the time, that more specifically makes mocking easier, but it could by anything else that helps as well.
Mocking and fixtures are often easier to do if you have inversion of control as part of your design
- roughly because when that often means dependencies are things you can hand in
- ...rather than having to become part of a shared environment by more creative means.
- so that you don't have to resort to monkey patching
That also often makes mocking a more structured, less awkward way of doing mocking, monkey patching, and other such things.
And again, mostly for tests.