The cost of (not) testing software

As a long-time automation-engineer/test-focused guy I’ve pondered the great existential question of “how much testing” is enough for awhile.

More recently, I’ve started focusing on the cost of not testing a product.

Take for example, Figure 1:

initial_flow.png

Let’s take a second for terminology:

  • (A) Unit tests: These are tests focused on developer and maintainer productivity. These are “close to the code” tests that run in mostly simulated environments. Unit tests are a cornerstone of Agile methodology – generally speaking, you make these before your code.
  • (B) Smoke/Simulation: These are the “next layer up” – they use partial systems (e.g. your code + the guy’s next to you module) to run more integration-style testing. Smokes are normally run on every compilation of the product along with unit tests. They do not require a fully deployed, functioning system – only a small group of parts.
  • (C) Acceptance/Functional/Regression:
    • Acceptance Test: These normally comprise a large number of your tests
      in an organization. Acceptance tests prove that the specific
      component/feature is sane in the context of the fully deployed product
      - you might require these to be fully developed, executed and passing
      before a specific component or feature is merged to trunk. Acceptance
      tests prove that the feature/component works as intended (not
      programmed). They should be short in execution time.

    • Functional Tests: Functional tests are “larger” and should test as
      much of the functionality of the feature/component as possible, they
      should also test with an eye towards other parts of the product and
      system (e.g. integration). Functional tests should be as expansive and
      detailed as possible. These can also be called Regression tests.
  • (D) Stress/Scalability Tests: This should be self-evident. Stress tests
    build on functional areas to push the product to it’s limits – how
    many files can it hold, how many connections can it withstand, etc.

  • (E) Performance Tests: Characterization of key performance stats:
    Objects/second records parsed/sec, and so on.

Now, I want to point out: These definitions are part-agile and part-continuous integration. They don’t wholly mesh with terminology used your workplace, or agile. I also know definitions are a holy war, but the definitions are secondary to what I want to talk about. I also excluded specifically calling out exploratory testing.

What the hell *am* I talking about?

If you look at figure A, You’ll note I put “Test” (test engineering) off to the side to represent their particular ownership in this model. Unit Tests (and by most measures, smoke and simulation tests) are under the ownership of the core developers.

The other test areas are the ownership of test engineering – obviously they would not exclude Dev from helping though (after all, they win as a team, and fail as a team) but Test is focused on verification that the product is as tested-as-possible before it gets into stage F – the hands of the user.

Ok, this is all fine and good – but hear me out.

This diagram is about cost – for each layer the code/feature passes through emanating from the developer, the cost to the team, and the difficulty in identification and resolution climbs.

This is why Developers write a lot of unit tests and check them in so they run with every check in. Right? You’re doing that, right?! The cost for a developer to find a bug with a unit test, and the cost to fix that bug introduced through new code/refactoring/etc, is essentially 1.

Here’s a new diagram with some straw-man costs:

cost_flow.png

Essentially, it is in your best interest, as a developer, as a team, to encourage lots and lots of tests lower in the stacks shown here. It starts with comprehensive, checked in unit tests. It continues with having a strong, repeatable testing discipline (for which I recommend test automation).

Why? Because – as you move higher in the stack, that damned bug someone checked in is hidden behind layer upon layer of code. The further from the unit level a bug gets, the more components and environment variables get involved. The more of these that get involved, the harder it is to identify and fix, and the higher the cost.

Now, your bug (our bug) has not only wasted your time, it’s holding up a release, test engineers time (albeit – this is our job) is wasted. The higher in the stack a bug gets – the higher the cost in wasted man, release and test hours.

For example – your typo in some messaging code manages to sneak its way through to the (E) Performance level. Let’s say your performance tests take, oh, a week to run to completion. For some reason, this sneaky beast only pops up when your system’s clocks resync after 6 days of runtime.

So, 6 days into a 7 day test – ding fries are done – the entire system poops itself. You now have to triage the crash, you have to fix it after you identify it (which is probably going to be hard – given it’s a performance test, you shut off non essential logging) and then you need to re run the test.

You lost 6 days. More than likely, those are 6 days of lost time you didn’t allocate for when you promised the fruits of this iteration/release to those wealthy swedish bankers, eh?

God help you if your bug gets to level (F). This is called the “aversion level” because after a few of these sneak out, and the CEO of the company starts getting phone calls at 4am from those swedish bankers – you’re either going to get a stern talking to, or some time in “the box” (all CEOs have a punishment box).

Your goal is to avert bugs from reaching Level F. F stands for F’ed in the literal sense.

My point isn’t just about cost. Given this tiered approach, and the need to find as many bugs as possible, you’re going to end up having some amount of code duplication between the higher levels of testing and the unit/smoke level – after all, most of the tests above that level are external-system level tests.

Some code – or logic – duplication on a higher level isn’t always bad, given the context of where the code is running. Not to mention, frequently, the code within the product may not be in the same language as the code that’s automating the tests. Duplication of unit test logic on a system-test-level is always going to happen.

Yes, you can and should reuse code as much as possible, but you can also do this through grey-box testing approaches (e.g. exposing APIs into system internals you would not normally have access to).

Also – this means you have to give your teams time to test. You need to give them ample time to automate what is reasonable, and you need to be willing to not ship a component or feature that simply isn’t ready. Much less one that hasn’t been tested.

The last thing you want is to have a bug – no matter what it is – hit level F. You, our job on a software engineering team is to put out the absolute best product possible – and you can’t do that without filling in all of the magical testing boxes. You need to understand that for every step away from the code you get, the higher the cost.

Letting preventable bugs get in the hands of users is not avoidable – but the risk can be mitigated, and many bugs that do end up in the hands of users are avoidable. The more (and sooner) you test, the lest wealth you expend, and the happier you will be. And the more profits you will reap. We like money.

  • Patrick Maupin
    Years ago (1984 to be exact), the company I was working for bought a "quality" course to give to their employees. It was complete with workbooks, videos with 70's hairstyles, etc. One of the main things that stuck with me in that course was what they called the "1-10-100 rule". (Interestingly, I just failed in trying to google this rule. Maybe they predated the net by too far.) Anyway, the crux of this rule was equivalent to one of your main points. If the developer finds and fixes something before anybody else is involved, it costs "1". At the next level of testing, it costs "10", Then "100", "1000", and if the customer gets it, it could be up in the millions.

    Fast forward a decade to 1994. The company I just started working for used exactly the same video course. The only things that had changed were that I had 10 years of experience since the last time I took the course, and the hairstyles in the videos looked even more ridiculous. When the instructor asked if there were any questions after going over the 1-10-100 rule, I raised my hand and asked "What does (the company) do do recognize and reward employees who find and fix things at the 1 or 10 level?" He seemed taken aback, and mumbled something about God and Country and the Right Thing To Do. I persisted: "Obviously, for (the company), it's good when their employees take this to heart, but how does the company prove that they like this to the employees?" More blathering from the instructor including "Look, OBVIOUSLY you will be rewarded better if you find and fix things earlier." to which I replied "Really? In my experience, companies recognize and reward employees who fix things at AT LEAST the 10000 level. Those are the employees whose names are known by the CEO of the company, and they get a pat on the back, and money, for putting out fires at the customer, even if they are the SOB who screwed it up in the first place."

    Since that time, I've spent most of my time working for chip companies. They seem to do slightly better than at least a few of the other places I have worked at the whole "nail it early" thing, partly because of the cost involved in making a mask set, and partly because of the opportunity cost of having to wait many weeks for a chip re-spin. Unfortunately, there is still an "assumed competence" that people are allowed, which is often far too generous, and lets really simple things slip through, simply because a very few developers are truly incompetent, and relying on them to do any coding, much less to unit test their own code, is a recipe for disaster. (This is a different category of people than the junior developers who just need a bit of mentoring. These are people who have managed to careen through a career with a thin veneer of being at the right place at the right time.)I wish I knew of a better answer to this than simply "document all the screwups as thoroughly as possible, and maybe one day they will fire the bastard."
  • Ali
    Thanks for your article. We use unit testing here to keep ourselves (developers) happy, but our "real" tests are functional/acceptance.

    It might just be the industry that we are in, but when buying our software, most clients do their own acceptance testing on it for their own audit processes. We actually ship a copy of the forms for doing the acceptance testing with our software (just in case anyone gets any bright ideas and does something fancy!).

    Additionally, I find nothing beats long-term production use as a test tool, but this takes years.
  • Oh, for sure customer's have acceptance testing - in a perfect world, your company/team would be more than open to hand the customer any tests which are pure black-box tests (automated and manual) to showcase the testing you have performed, and to assist them in their own testing.
  • Seems like such common sense why don't more people get it ?

    One thing missing from this post - before you get to Step A, how about testing the spec/requirements to flush out ambiguities and inconsistencies ? How much does it cost to read a document and test it without a line of code being written ?

    Seems to be some good stuff here so adding it to my feeds
  • As for testing the spec and the requirement - see this post:

    http://jessenoller.com/2008/08/12/steve-yegge-h...
  • It's frightening at how not common common sense is.
  • It may be really uncool to say it, but I'm getting back into the old-fashioned "play with it until it breaks" testing strategy, with one special exception. I also use a whole bunch of super-fast doctests.

    I spent a lot of time trying to be a good TDDer, writing stuff with twill and selenium, making elaborate automated walkthroughs of my web-app, but those tests constantly required rewrites, and they ran really slow. So I'm scrapping that approach, at least for a while.

    So now I try to write lots of tiny functions that I can easily test with doctests, and then (right now, anyway), I skip over all that middle stuff of integration testing and spend a fair amount of time just manually poking around.

    So far, it works pretty well.

    I think of stress-testing as something completely different, because that kind of test usually reveals flaws in my design, rather than just the execution of my design. What I mean is that my unit test verifies that a function does what I want it to do. But when I run a stress test, I find the algorithms that have some unexpected polynomial runtimes, rather than O(1) like I thought.
  • Aaron Oliver
    I'm starting to feel the test re-writing pain myself. I think this overhead is a secret, second wave of opposition to testing. Much more insidious than the initial "Wha? I have to write more code?" barrier.
  • See, I too have problems "adopting" test driven development - I fully believe in unit tests, and some amount of test-first development, but doing full blown "design the tests before the code" doesn't mentally mesh with me (as a developer) given I want to prototype prior to writing tests just so I know where I'm basically going.

    That said - I firmly believe in a strong test suite of both units and functional tests once you've got something that's slightly firmer than quick sand.
  • nadnerb
    If you are writing a prototype (or spike) you shouldn't write tests anyway. You are writing throwaway code so I don't see your point.

    I still spike a solution to see that is is feasible and also use TDD. I think you are (as does just about everybody) missing the point of TDD. You don't write the complete test then the code. You use the test to drive out the design. You might, say, write a test for a game that has a score and it can be zero. When you run this test it should fail. You then implement the code and the test passes. You might then write the next test saying the game can only have a maximum score of 100... and so on.

    You use the test to drive out the design. The tests that you have in the end are a bonus. Also you will not suffer from YAGNI.

    I don't alway TDD (testing legacy code for example), but I have found it useful when you understand how and why it should be used.
blog comments powered by Disqus