The cost of (not) testing software

September 17th, 2008 § 12 comments

As a long-time automation-engineer/test-focused guy I’ve pon­dered the great exis­ten­tial ques­tion of “how much test­ing” is enough for awhile.

More recently, I’ve started focus­ing on the cost of not test­ing a product.

Take for exam­ple, Fig­ure 1:

initial_flow.png

Let’s take a sec­ond for terminology:

  • (A) Unit tests: These are tests focused on devel­oper and main­tainer pro­duc­tiv­ity. These are “close to the code” tests that run in mostly sim­u­lated envi­ron­ments. Unit tests are a cor­ner­stone of Agile method­ol­ogy — gen­er­ally speak­ing, you make these before your code.
  • (B) Smoke/Simulation: These are the “next layer up” — they use par­tial sys­tems (e.g. your code + the guy’s next to you mod­ule) to run more integration-style test­ing. Smokes are nor­mally run on every com­pi­la­tion of the prod­uct along with unit tests. They do not require a fully deployed, func­tion­ing sys­tem — only a small group of parts.
  • © Acceptance/Functional/Regression:
    • Accep­tance Test: These nor­mally com­prise a large num­ber of your tests
      in an orga­ni­za­tion. Accep­tance tests prove that the spe­cific
      component/feature is sane in the con­text of the fully deployed prod­uct
      – you might require these to be fully devel­oped, exe­cuted and pass­ing
      before a spe­cific com­po­nent or fea­ture is merged to trunk. Accep­tance
      tests prove that the feature/component works as intended (not
      pro­grammed). They should be short in exe­cu­tion time.

    • Func­tional Tests: Func­tional tests are “larger” and should test as
      much of the func­tion­al­ity of the feature/component as pos­si­ble, they
      should also test with an eye towards other parts of the prod­uct and
      sys­tem (e.g. inte­gra­tion). Func­tional tests should be as expan­sive and
      detailed as pos­si­ble. These can also be called Regres­sion tests.
  • (D) Stress/Scalability Tests: This should be self-evident. Stress tests
    build on func­tional areas to push the prod­uct to it’s lim­its — how
    many files can it hold, how many con­nec­tions can it with­stand, etc.

  • (E) Per­for­mance Tests: Char­ac­ter­i­za­tion of key per­for­mance stats:
    Objects/second records parsed/sec, and so on.

Now, I want to point out: These def­i­n­i­tions are part-agile and part-continuous inte­gra­tion. They don’t wholly mesh with ter­mi­nol­ogy used your work­place, or agile. I also know def­i­n­i­tions are a holy war, but the def­i­n­i­tions are sec­ondary to what I want to talk about. I also excluded specif­i­cally call­ing out exploratory testing.

What the hell *am* I talk­ing about?

If you look at fig­ure A, You’ll note I put “Test” (test engi­neer­ing) off to the side to rep­re­sent their par­tic­u­lar own­er­ship in this model. Unit Tests (and by most mea­sures, smoke and sim­u­la­tion tests) are under the own­er­ship of the core developers.

The other test areas are the own­er­ship of test engi­neer­ing — obvi­ously they would not exclude Dev from help­ing though (after all, they win as a team, and fail as a team) but Test is focused on ver­i­fi­ca­tion that the prod­uct is as tested-as-possible before it gets into stage F — the hands of the user.

Ok, this is all fine and good — but hear me out.

This dia­gram is about cost — for each layer the code/feature passes through ema­nat­ing from the devel­oper, the cost to the team, and the dif­fi­culty in iden­ti­fi­ca­tion and res­o­lu­tion climbs.

This is why Devel­op­ers write a lot of unit tests and check them in so they run with every check in. Right? You’re doing that, right?! The cost for a devel­oper to find a bug with a unit test, and the cost to fix that bug intro­duced through new code/refactoring/etc, is essen­tially 1.

Here’s a new dia­gram with some straw-man costs:

cost_flow.png

Essen­tially, it is in your best inter­est, as a devel­oper, as a team, to encour­age lots and lots of tests lower in the stacks shown here. It starts with com­pre­hen­sive, checked in unit tests. It con­tin­ues with hav­ing a strong, repeat­able test­ing dis­ci­pline (for which I rec­om­mend test automation).

Why? Because — as you move higher in the stack, that damned bug some­one checked in is hid­den behind layer upon layer of code. The fur­ther from the unit level a bug gets, the more com­po­nents and envi­ron­ment vari­ables get involved. The more of these that get involved, the harder it is to iden­tify and fix, and the higher the cost.

Now, your bug (our bug) has not only wasted your time, it’s hold­ing up a release, test engi­neers time (albeit — this is our job) is wasted. The higher in the stack a bug gets — the higher the cost in wasted man, release and test hours.

For exam­ple — your typo in some mes­sag­ing code man­ages to sneak its way through to the (E) Per­for­mance level. Let’s say your per­for­mance tests take, oh, a week to run to com­ple­tion. For some rea­son, this sneaky beast only pops up when your system’s clocks resync after 6 days of runtime.

So, 6 days into a 7 day test — ding fries are done — the entire sys­tem poops itself. You now have to triage the crash, you have to fix it after you iden­tify it (which is prob­a­bly going to be hard — given it’s a per­for­mance test, you shut off non essen­tial log­ging) and then you need to re run the test.

You lost 6 days. More than likely, those are 6 days of lost time you didn’t allo­cate for when you promised the fruits of this iteration/release to those wealthy swedish bankers, eh?

God help you if your bug gets to level (F). This is called the “aver­sion level” because after a few of these sneak out, and the CEO of the com­pany starts get­ting phone calls at 4am from those swedish bankers — you’re either going to get a stern talk­ing to, or some time in “the box” (all CEOs have a pun­ish­ment box).

Your goal is to avert bugs from reach­ing Level F. F stands for F’ed in the lit­eral sense.

My point isn’t just about cost. Given this tiered approach, and the need to find as many bugs as pos­si­ble, you’re going to end up hav­ing some amount of code dupli­ca­tion between the higher lev­els of test­ing and the unit/smoke level — after all, most of the tests above that level are external-system level tests.

Some code — or logic — dupli­ca­tion on a higher level isn’t always bad, given the con­text of where the code is run­ning. Not to men­tion, fre­quently, the code within the prod­uct may not be in the same lan­guage as the code that’s automat­ing the tests. Dupli­ca­tion of unit test logic on a system-test-level is always going to happen.

Yes, you can and should reuse code as much as pos­si­ble, but you can also do this through grey-box test­ing approaches (e.g. expos­ing APIs into sys­tem inter­nals you would not nor­mally have access to).

Also — this means you have to give your teams time to test. You need to give them ample time to auto­mate what is rea­son­able, and you need to be will­ing to not ship a com­po­nent or fea­ture that sim­ply isn’t ready. Much less one that hasn’t been tested.

The last thing you want is to have a bug — no mat­ter what it is — hit level F. You, our job on a soft­ware engi­neer­ing team is to put out the absolute best prod­uct pos­si­ble — and you can’t do that with­out fill­ing in all of the mag­i­cal test­ing boxes. You need to under­stand that for every step away from the code you get, the higher the cost.

Let­ting pre­ventable bugs get in the hands of users is not avoid­able — but the risk can be mit­i­gated, and many bugs that do end up in the hands of users are avoid­able. The more (and sooner) you test, the lest wealth you expend, and the hap­pier you will be. And the more prof­its you will reap. We like money.

  • http://blog.tplus1.com Matt Wil­son

    It may be really uncool to say it, but I’m get­ting back into the old-fashioned “play with it until it breaks” test­ing strat­egy, with one spe­cial excep­tion. I also use a whole bunch of super-fast doctests.

    I spent a lot of time try­ing to be a good TDDer, writ­ing stuff with twill and sele­nium, mak­ing elab­o­rate auto­mated walk­throughs of my web-app, but those tests con­stantly required rewrites, and they ran really slow. So I’m scrap­ping that approach, at least for a while.

    So now I try to write lots of tiny func­tions that I can eas­ily test with doctests, and then (right now, any­way), I skip over all that mid­dle stuff of inte­gra­tion test­ing and spend a fair amount of time just man­u­ally pok­ing around.

    So far, it works pretty well.

    I think of stress-testing as some­thing com­pletely dif­fer­ent, because that kind of test usu­ally reveals flaws in my design, rather than just the exe­cu­tion of my design. What I mean is that my unit test ver­i­fies that a func­tion does what I want it to do. But when I run a stress test, I find the algo­rithms that have some unex­pected poly­no­mial run­times, rather than O(1) like I thought.

  • http://expectedresults.blogspot.com phil kirkham

    Seems like such com­mon sense why don’t more peo­ple get it ?

    One thing miss­ing from this post — before you get to Step A, how about test­ing the spec/requirements to flush out ambi­gu­i­ties and incon­sis­ten­cies ? How much does it cost to read a doc­u­ment and test it with­out a line of code being written ?

    Seems to be some good stuff here so adding it to my feeds

  • jnoller

    See, I too have prob­lems “adopt­ing” test dri­ven devel­op­ment — I fully believe in unit tests, and some amount of test-first devel­op­ment, but doing full blown “design the tests before the code” doesn’t men­tally mesh with me (as a devel­oper) given I want to pro­to­type prior to writ­ing tests just so I know where I’m basi­cally going.

    That said — I firmly believe in a strong test suite of both units and func­tional tests once you’ve got some­thing that’s slightly firmer than quick sand.

  • jnoller

    It’s fright­en­ing at how not com­mon com­mon sense is.

  • jnoller

    As for test­ing the spec and the require­ment — see this post:

    http://jessenoller.com/2008/08/12/steve-yegge-h

  • http://unpythonic.blogspot.com Ali

    Thanks for your arti­cle. We use unit test­ing here to keep our­selves (devel­op­ers) happy, but our “real” tests are functional/acceptance.

    It might just be the indus­try that we are in, but when buy­ing our soft­ware, most clients do their own accep­tance test­ing on it for their own audit processes. We actu­ally ship a copy of the forms for doing the accep­tance test­ing with our soft­ware (just in case any­one gets any bright ideas and does some­thing fancy!).

    Addi­tion­ally, I find noth­ing beats long-term pro­duc­tion use as a test tool, but this takes years.

  • jnoller

    Oh, for sure customer’s have accep­tance test­ing — in a per­fect world, your company/team would be more than open to hand the cus­tomer any tests which are pure black-box tests (auto­mated and man­ual) to show­case the test­ing you have per­formed, and to assist them in their own testing.

  • nad­nerb

    If you are writ­ing a pro­to­type (or spike) you shouldn’t write tests any­way. You are writ­ing throw­away code so I don’t see your point.

    I still spike a solu­tion to see that is is fea­si­ble and also use TDD. I think you are (as does just about every­body) miss­ing the point of TDD. You don’t write the com­plete test then the code. You use the test to drive out the design. You might, say, write a test for a game that has a score and it can be zero. When you run this test it should fail. You then imple­ment the code and the test passes. You might then write the next test say­ing the game can only have a max­i­mum score of 100… and so on.

    You use the test to drive out the design. The tests that you have in the end are a bonus. Also you will not suf­fer from YAGNI.

    I don’t alway TDD (test­ing legacy code for exam­ple), but I have found it use­ful when you under­stand how and why it should be used.

  • Patrick Maupin

    Years ago (1984 to be exact), the com­pany I was work­ing for bought a “qual­ity” course to give to their employ­ees. It was com­plete with work­books, videos with 70’s hair­styles, etc. One of the main things that stuck with me in that course was what they called the “1–10-100 rule”. (Inter­est­ingly, I just failed in try­ing to google this rule. Maybe they pre­dated the net by too far.) Any­way, the crux of this rule was equiv­a­lent to one of your main points. If the devel­oper finds and fixes some­thing before any­body else is involved, it costs “1”. At the next level of test­ing, it costs “10”, Then “100”, “1000”, and if the cus­tomer gets it, it could be up in the millions.

    Fast for­ward a decade to 1994. The com­pany I just started work­ing for used exactly the same video course. The only things that had changed were that I had 10 years of expe­ri­ence since the last time I took the course, and the hair­styles in the videos looked even more ridicu­lous. When the instruc­tor asked if there were any ques­tions after going over the 1–10-100 rule, I raised my hand and asked “What does (the com­pany) do do rec­og­nize and reward employ­ees who find and fix things at the 1 or 10 level?” He seemed taken aback, and mum­bled some­thing about God and Coun­try and the Right Thing To Do. I per­sisted: “Obvi­ously, for (the com­pany), it’s good when their employ­ees take this to heart, but how does the com­pany prove that they like this to the employ­ees?” More blath­er­ing from the instruc­tor includ­ing “Look, OBVIOUSLY you will be rewarded bet­ter if you find and fix things ear­lier.” to which I replied “Really? In my expe­ri­ence, com­pa­nies rec­og­nize and reward employ­ees who fix things at AT LEAST the 10000 level. Those are the employ­ees whose names are known by the CEO of the com­pany, and they get a pat on the back, and money, for putting out fires at the cus­tomer, even if they are the SOB who screwed it up in the first place.”

    Since that time, I’ve spent most of my time work­ing for chip com­pa­nies. They seem to do slightly bet­ter than at least a few of the other places I have worked at the whole “nail it early” thing, partly because of the cost involved in mak­ing a mask set, and partly because of the oppor­tu­nity cost of hav­ing to wait many weeks for a chip re-spin. Unfor­tu­nately, there is still an “assumed com­pe­tence” that peo­ple are allowed, which is often far too gen­er­ous, and lets really sim­ple things slip through, sim­ply because a very few devel­op­ers are truly incom­pe­tent, and rely­ing on them to do any cod­ing, much less to unit test their own code, is a recipe for dis­as­ter. (This is a dif­fer­ent cat­e­gory of peo­ple than the junior devel­op­ers who just need a bit of men­tor­ing. These are peo­ple who have man­aged to careen through a career with a thin veneer of being at the right place at the right time.)I wish I knew of a bet­ter answer to this than sim­ply “doc­u­ment all the screwups as thor­oughly as pos­si­ble, and maybe one day they will fire the bastard.”

  • Aaron Oliver

    I’m start­ing to feel the test re-writing pain myself. I think this over­head is a secret, sec­ond wave of oppo­si­tion to test­ing. Much more insid­i­ous than the ini­tial “Wha? I have to write more code?” barrier.

  • Aaron Oliver

    I’m start­ing to feel the test re-writing pain myself. I think this over­head is a secret, sec­ond wave of oppo­si­tion to test­ing. Much more insid­i­ous than the ini­tial “Wha? I have to write more code?” barrier.

  • Pingback: The Early Bird Catches The Worm « Software Quality Matters

What's this?

You are currently reading The cost of (not) testing software at jessenoller.com.

meta