The cost of (not) testing software

by jesse in ,

As a long-time automation-engineer/test-focused guy I've pondered the great existential question of "how much testing" is enough for awhile. More recently, I've started focusing on the cost of not testing a product.

Take for example, Figure 1:


Let's take a second for terminology:

  • (A) Unit tests: These are tests focused on developer and maintainer productivity. These are "close to the code" tests that run in mostly simulated environments. Unit tests are a cornerstone of Agile methodology - generally speaking, you make these before your code.
  • (B) Smoke/Simulation: These are the "next layer up" - they use partial systems (e.g. your code + the guy's next to you module) to run more integration-style testing. Smokes are normally run on every compilation of the product along with unit tests. They do not require a fully deployed, functioning system - only a small group of parts.
  • (C) Acceptance/Functional/Regression:
    • Acceptance Test: These normally comprise a large number of your tests in an organization. Acceptance tests prove that the specific component/feature is sane in the context of the fully deployed product - you might require these to be fully developed, executed and passing before a specific component or feature is merged to trunk. Acceptance tests prove that the feature/component works as intended (not programmed). They should be short in execution time.

    • Functional Tests: Functional tests are "larger" and should test as much of the functionality of the feature/component as possible, they should also test with an eye towards other parts of the product and system (e.g. integration). Functional tests should be as expansive and detailed as possible. These can also be called Regression tests.
  • (D) Stress/Scalability Tests: This should be self-evident. Stress tests build on functional areas to push the product to it's limits - how many files can it hold, how many connections can it withstand, etc.

  • (E) Performance Tests: Characterization of key performance stats: Objects/second records parsed/sec, and so on.

Now, I want to point out: These definitions are part-agile and part-continuous integration. They don't wholly mesh with terminology used your workplace, or agile. I also know definitions are a holy war, but the definitions are secondary to what I want to talk about. I also excluded specifically calling out exploratory testing.

What the hell *am* I talking about?

If you look at figure A, You'll note I put "Test" (test engineering) off to the side to represent their particular ownership in this model. Unit Tests (and by most measures, smoke and simulation tests) are under the ownership of the core developers.

The other test areas are the ownership of test engineering - obviously they would not exclude Dev from helping though (after all, they win as a team, and fail as a team) but Test is focused on verification that the product is as tested-as-possible before it gets into stage F - the hands of the user.

Ok, this is all fine and good - but hear me out.

This diagram is about cost - for each layer the code/feature passes through emanating from the developer, the cost to the team, and the difficulty in identification and resolution climbs.

This is why Developers write a lot of unit tests and check them in so they run with every check in. Right? You're doing that, right?! The cost for a developer to find a bug with a unit test, and the cost to fix that bug introduced through new code/refactoring/etc, is essentially 1.

Here's a new diagram with some straw-man costs:


Essentially, it is in your best interest, as a developer, as a team, to encourage lots and lots of tests lower in the stacks shown here. It starts with comprehensive, checked in unit tests. It continues with having a strong, repeatable testing discipline (for which I recommend test automation).

Why? Because - as you move higher in the stack, that damned bug someone checked in is hidden behind layer upon layer of code. The further from the unit level a bug gets, the more components and environment variables get involved. The more of these that get involved, the harder it is to identify and fix, and the higher the cost.

Now, your bug (our bug) has not only wasted your time, it's holding up a release, test engineers time (albeit - this is our job) is wasted. The higher in the stack a bug gets - the higher the cost in wasted man, release and test hours.

For example - your typo in some messaging code manages to sneak its way through to the (E) Performance level. Let's say your performance tests take, oh, a week to run to completion. For some reason, this sneaky beast only pops up when your system's clocks resync after 6 days of runtime.

So, 6 days into a 7 day test - ding fries are done - the entire system poops itself. You now have to triage the crash, you have to fix it after you identify it (which is probably going to be hard - given it's a performance test, you shut off non essential logging) and then you need to re run the test.

You lost 6 days. More than likely, those are 6 days of lost time you didn't allocate for when you promised the fruits of this iteration/release to those wealthy swedish bankers, eh?

God help you if your bug gets to level (F). This is called the "aversion level" because after a few of these sneak out, and the CEO of the company starts getting phone calls at 4am from those swedish bankers - you're either going to get a stern talking to, or some time in "the box" (all CEOs have a punishment box).

Your goal is to avert bugs from reaching Level F. F stands for F'ed in the literal sense.

My point isn't just about cost. Given this tiered approach, and the need to find as many bugs as possible, you're going to end up having some amount of code duplication between the higher levels of testing and the unit/smoke level - after all, most of the tests above that level are external-system level tests.

Some code - or logic - duplication on a higher level isn't always bad, given the context of where the code is running. Not to mention, frequently, the code within the product may not be in the same language as the code that's automating the tests. Duplication of unit test logic on a system-test-level is always going to happen.

Yes, you can and should reuse code as much as possible, but you can also do this through grey-box testing approaches (e.g. exposing APIs into system internals you would not normally have access to).

Also - this means you have to give your teams time to test. You need to give them ample time to automate what is reasonable, and you need to be willing to not ship a component or feature that simply isn't ready. Much less one that hasn't been tested.

The last thing you want is to have a bug - no matter what it is - hit level F. You, our job on a software engineering team is to put out the absolute best product possible - and you can't do that without filling in all of the magical testing boxes. You need to understand that for every step away from the code you get, the higher the cost.

Letting preventable bugs get in the hands of users is not avoidable - but the risk can be mitigated, and many bugs that do end up in the hands of users are avoidable. The more (and sooner) you test, the lest wealth you expend, and the happier you will be. And the more profits you will reap. We like money.

Welcome to TestButler, a rudimentary test case management app.

by jesse in , ,

... Or, learn to laugh at my total inability to do web design, and lack of django-fu So, following up (albeit slowly) on my "Decent test case tracking/registration" post, I've actually managed to cobble together a google code project, and a rudimentary django application.

Right now, it's in sub-prototype stages. I've done a semi-production deployment internally to get feedback/usage information and suggestions. All the code is checked in and now I need to begin cleaning things up from my rather random "pooping of code".

Not only am I learning Django while I am doing this - I'm catching up on 6+ years of changes in the web development community. The last time I was involved in any sort of web-work was when I worked for Allaire/Macromedia - and even then that was primarily on the back end to ColdFusion, not end-user interfaces.

Writing user-interfaces above a command-line utility is not exactly my strong suit. But hell, Django made it wicked easy to start hacking things together. I had the rough-backend done in less than 2 hours, which let me spend the next few days pondering schemas, mucking with many-to-many fields and other django plugins.

If you go an look at the the google code site, you'll see I've started fleshing out the bits needed to outline the path of the project, and the general reasoning behind it.

Not only do I want feedback - I want to let anyone who wants to join, to join. Contribute ideas, tell me I'm doing it wrong. I already know my django code is messy (I'm working on it) - but most of all I want to help build something useful for the testing community, so if something doesn't mesh, I want to know.

Now, I just need to read my copy of James Bennetts "Practical Django Projects" book. And make a vector-image of a cartoony roomba, or find a better image of a robot butler.

A Peer to Peer test distribution system (TestBot)?

by jesse in , ,

Peer-to-Peer systems aren't something new. Things like Bittorrent, AllMyData Tahoe, and others have been using it for file storage for some time. Still others use the distributed-worker methodologies to do work parceling - they register with the system, and the system hands out chunks of work without factoring in client speed/etc (e.g.

What if you combined the two - you used something like Bittorrent which does peer-selection and allocation intelligently, with a large distributed architecture to manage large scale test execution?

Let's think about a common problem with test engineering. Start with a simple version - you're designing a load test app, this app needs to generate large amounts of load against a target system.

In a normal test environment in a lab - this is "easy" - you simply make sure you have a lab with a bunch of clients, all on the same LAN and you run a test client from all of them that generate load against the system under test.

Now, let's complicate the problem: You don't have enough "same same" test clients. You may have some "close enough" but dang - they're not on the same subnet, or you don't know about them. Not having enough clients in a lab is more common than you'd think.

So how do you make a test that can take advantage of those test clients, factor in their "differences" and still make a relevant test?

Next problem. You have an application you want to run a battery of tests against. You don't have a dedicated client, but you have the possibility of "borrowing time" from some idle machines to run those tests.

The "idle machines" all have different ram, CPU and are varying distances from the system under test on the network. You need to 1> Find them, 2> Figure out which of the available test clients is the most desirable 3> Be able to figure out the main differences between the clients to factor them into results.

You simply want the more capable clients to get more of the "important" tests, and the less capable ones to run the lesser tests. Just to add to it, you want them to possibly be capable of being slaved to a given test to help it along (i.e. a performance or generalized load generation test).

Getting back to the original thought about peer-to-peer systems, I started considering the possibility of applying the peer to peer paradigm/weighted selection to test distribution.

You have a series of clients who volunteer to participate in the swarm. The client responsible for submitting the job (a test) to the swarm would use a Weighted Voting algorithm to rank, sort and choose the "most desirable" clients to distribute a test to.

Each client would respond to a submitted request with various attributes (weights) based on OS Type, number of hops from the client submitting the job and the system-under-test, amount of ram, network speed and so on.

In the case of performance based tests, you would be able to factor these attributes into the results of the test (e.g. latency) - in other tests, you only need to gather the results.

Of course, the concept of a "use idle machines to do something" isn't exactly new - things like, seti@home and others do this all the time as I mentioned before.

Then you have things like buildbot - buildbot uses a dedicated (or partially dedicated) pool of machines to compile a target and execute the local unit tests against the compiled thing.

Why not make the two go hand in hand and make an intelligent weighted selection for test distribution? Let's go back to the localized example. You have a continuous build system which compiles and run units. It then looks at a pool of test-peers who have volunteered to be part of the test-swarm and fires off the functional/regression tests (as needed, it can locally deploy or remotely deploy to a test-server).

The buildbot reports the steps as compile: pass, units: pass, and then regression: pending - the buildbot passes out the various tests to the swarm which can be executed asynchronously until all tests are completed (or error'd at which point they're passed back to another client in the swarm).

The nice thing is that this works on both a local LAN, and a globally distributed series of test swarm participants. All you do is weight in favor of the closer clients. (oh, and your application has to be available on the network).

Over time, peers participating in the swarm can be "pushed out" - meaning they have error'd out too many times, have been caught "lying" and so on. The swarm can adapt - clients can come and go as long as a given passed out suite eventually completes. If a client fails/drops, the test is simple re-passed out.

On a localized (meaning, internal-to-your-company) level, this means you can make any client on your network a peer on the system, and the weight-based selection system still applies and you can use any type of system on your LAN - desktops, servers, highly intelligent coffee makers - anything with a network drop.

Additionally, you could point test slaves at a cluster of installed system-under-tests - individual nodes in a web farm, or your application installed on various web hosts. Or a larger system installed in various data centers. This removes the bottleneck of a singular system being tested at once (but requires a lot of intelligence on the managerial level).

It's an idea. Something of a disconnected series of thoughts - maybe it's silly. I like the idea of being able to intelligently leverage a series of test peers distributed anywhere and everywhere. Having a peer-to-peer testing system would be neat-o.

It's a zombie army used for testing -Anon :)

edit: Yes, a loosely coupled, highly distributed load test could be construed as a DDoS... But that's semantics, right?

References/Interesting Reading:

YAML question, and a nose-testconfig thought

by jesse in , ,

So, I find myself using more and more YAML lately via the pyyaml package. When I was writing nose-testconfig my "preferred" format was/is YAML. Now, an interesting thing I've noticed about all of the test configurations I am developing/working with is that they have a lot of "shared" attributes (that change infrequently) and a good number of things which change all the time.

This is the perfect spot for something like a dictionary merge. If you have a test config like this:

application: capability: 1 url: http://foo subsystem: max_users: 20

For each of your configuration files, you might only override something like, max_users. For cases like this, it makes sense to load the template document (the file above) and then perform a dict.merge() after loading the second document (overriding the values in the first load) or something akin to that.

This is where my mental dilemma comes in. I could in theory, add a custom !!tag to the yaml which would take a /path/to/file.yaml and load it first, then load the second document or I could do it within nose-testconfig where you might run:

nosetests . --tc-file=myconfig.yaml --tc-rootconfig=parent.yaml

And then I would jump through the hoops (with a merge probably) within the plugin. The problem with that is that I'm worried about coupling the plugin too closely to yaml.

Now, the plugin already supports overriding multiple values: However, this doesn't scale if you have to override a lot of them.

The most common reason I've found for this so far is adding new parameters and values to the YAML files - not all child configurations need to override/define the new values, instead they could just inherit from the parent.

So, the question is - how do (would) you do this so you:

  • Don't sacrifice clarity/readability
  • Scales
  • Doesn't require the root document to be in the same location or have a hard coded path in the child document
  • Doesn't couple the loader (nose-testconfig) tightly with the file format

Right now, it's copy, paste, edit all configuration files I know about, etc.