The cost of (not) testing software

by jesse in ,


As a long-time automation-engineer/test-focused guy I've pondered the great existential question of "how much testing" is enough for awhile. More recently, I've started focusing on the cost of not testing a product.

Take for example, Figure 1:

initial_flow.png

Let's take a second for terminology:

  • (A) Unit tests: These are tests focused on developer and maintainer productivity. These are "close to the code" tests that run in mostly simulated environments. Unit tests are a cornerstone of Agile methodology - generally speaking, you make these before your code.
  • (B) Smoke/Simulation: These are the "next layer up" - they use partial systems (e.g. your code + the guy's next to you module) to run more integration-style testing. Smokes are normally run on every compilation of the product along with unit tests. They do not require a fully deployed, functioning system - only a small group of parts.
  • (C) Acceptance/Functional/Regression:
    • Acceptance Test: These normally comprise a large number of your tests in an organization. Acceptance tests prove that the specific component/feature is sane in the context of the fully deployed product - you might require these to be fully developed, executed and passing before a specific component or feature is merged to trunk. Acceptance tests prove that the feature/component works as intended (not programmed). They should be short in execution time.

    • Functional Tests: Functional tests are "larger" and should test as much of the functionality of the feature/component as possible, they should also test with an eye towards other parts of the product and system (e.g. integration). Functional tests should be as expansive and detailed as possible. These can also be called Regression tests.
  • (D) Stress/Scalability Tests: This should be self-evident. Stress tests build on functional areas to push the product to it's limits - how many files can it hold, how many connections can it withstand, etc.

  • (E) Performance Tests: Characterization of key performance stats: Objects/second records parsed/sec, and so on.

Now, I want to point out: These definitions are part-agile and part-continuous integration. They don't wholly mesh with terminology used your workplace, or agile. I also know definitions are a holy war, but the definitions are secondary to what I want to talk about. I also excluded specifically calling out exploratory testing.

What the hell *am* I talking about?

If you look at figure A, You'll note I put "Test" (test engineering) off to the side to represent their particular ownership in this model. Unit Tests (and by most measures, smoke and simulation tests) are under the ownership of the core developers.

The other test areas are the ownership of test engineering - obviously they would not exclude Dev from helping though (after all, they win as a team, and fail as a team) but Test is focused on verification that the product is as tested-as-possible before it gets into stage F - the hands of the user.

Ok, this is all fine and good - but hear me out.

This diagram is about cost - for each layer the code/feature passes through emanating from the developer, the cost to the team, and the difficulty in identification and resolution climbs.

This is why Developers write a lot of unit tests and check them in so they run with every check in. Right? You're doing that, right?! The cost for a developer to find a bug with a unit test, and the cost to fix that bug introduced through new code/refactoring/etc, is essentially 1.

Here's a new diagram with some straw-man costs:

cost_flow.png

Essentially, it is in your best interest, as a developer, as a team, to encourage lots and lots of tests lower in the stacks shown here. It starts with comprehensive, checked in unit tests. It continues with having a strong, repeatable testing discipline (for which I recommend test automation).

Why? Because - as you move higher in the stack, that damned bug someone checked in is hidden behind layer upon layer of code. The further from the unit level a bug gets, the more components and environment variables get involved. The more of these that get involved, the harder it is to identify and fix, and the higher the cost.

Now, your bug (our bug) has not only wasted your time, it's holding up a release, test engineers time (albeit - this is our job) is wasted. The higher in the stack a bug gets - the higher the cost in wasted man, release and test hours.

For example - your typo in some messaging code manages to sneak its way through to the (E) Performance level. Let's say your performance tests take, oh, a week to run to completion. For some reason, this sneaky beast only pops up when your system's clocks resync after 6 days of runtime.

So, 6 days into a 7 day test - ding fries are done - the entire system poops itself. You now have to triage the crash, you have to fix it after you identify it (which is probably going to be hard - given it's a performance test, you shut off non essential logging) and then you need to re run the test.

You lost 6 days. More than likely, those are 6 days of lost time you didn't allocate for when you promised the fruits of this iteration/release to those wealthy swedish bankers, eh?

God help you if your bug gets to level (F). This is called the "aversion level" because after a few of these sneak out, and the CEO of the company starts getting phone calls at 4am from those swedish bankers - you're either going to get a stern talking to, or some time in "the box" (all CEOs have a punishment box).

Your goal is to avert bugs from reaching Level F. F stands for F'ed in the literal sense.

My point isn't just about cost. Given this tiered approach, and the need to find as many bugs as possible, you're going to end up having some amount of code duplication between the higher levels of testing and the unit/smoke level - after all, most of the tests above that level are external-system level tests.

Some code - or logic - duplication on a higher level isn't always bad, given the context of where the code is running. Not to mention, frequently, the code within the product may not be in the same language as the code that's automating the tests. Duplication of unit test logic on a system-test-level is always going to happen.

Yes, you can and should reuse code as much as possible, but you can also do this through grey-box testing approaches (e.g. exposing APIs into system internals you would not normally have access to).

Also - this means you have to give your teams time to test. You need to give them ample time to automate what is reasonable, and you need to be willing to not ship a component or feature that simply isn't ready. Much less one that hasn't been tested.

The last thing you want is to have a bug - no matter what it is - hit level F. You, our job on a software engineering team is to put out the absolute best product possible - and you can't do that without filling in all of the magical testing boxes. You need to understand that for every step away from the code you get, the higher the cost.

Letting preventable bugs get in the hands of users is not avoidable - but the risk can be mitigated, and many bugs that do end up in the hands of users are avoidable. The more (and sooner) you test, the lest wealth you expend, and the happier you will be. And the more profits you will reap. We like money.