A Peer to Peer test distribution system (TestBot)?

by jesse in , ,


Peer-to-Peer systems aren't something new. Things like Bittorrent, AllMyData Tahoe, and others have been using it for file storage for some time. Still others use the distributed-worker methodologies to do work parceling - they register with the system, and the system hands out chunks of work without factoring in client speed/etc (e.g. distributed.net).

What if you combined the two - you used something like Bittorrent which does peer-selection and allocation intelligently, with a large distributed architecture to manage large scale test execution?

Let's think about a common problem with test engineering. Start with a simple version - you're designing a load test app, this app needs to generate large amounts of load against a target system.

In a normal test environment in a lab - this is "easy" - you simply make sure you have a lab with a bunch of clients, all on the same LAN and you run a test client from all of them that generate load against the system under test.

Now, let's complicate the problem: You don't have enough "same same" test clients. You may have some "close enough" but dang - they're not on the same subnet, or you don't know about them. Not having enough clients in a lab is more common than you'd think.

So how do you make a test that can take advantage of those test clients, factor in their "differences" and still make a relevant test?

Next problem. You have an application you want to run a battery of tests against. You don't have a dedicated client, but you have the possibility of "borrowing time" from some idle machines to run those tests.

The "idle machines" all have different ram, CPU and are varying distances from the system under test on the network. You need to 1> Find them, 2> Figure out which of the available test clients is the most desirable 3> Be able to figure out the main differences between the clients to factor them into results.

You simply want the more capable clients to get more of the "important" tests, and the less capable ones to run the lesser tests. Just to add to it, you want them to possibly be capable of being slaved to a given test to help it along (i.e. a performance or generalized load generation test).

Getting back to the original thought about peer-to-peer systems, I started considering the possibility of applying the peer to peer paradigm/weighted selection to test distribution.

You have a series of clients who volunteer to participate in the swarm. The client responsible for submitting the job (a test) to the swarm would use a Weighted Voting algorithm to rank, sort and choose the "most desirable" clients to distribute a test to.

Each client would respond to a submitted request with various attributes (weights) based on OS Type, number of hops from the client submitting the job and the system-under-test, amount of ram, network speed and so on.

In the case of performance based tests, you would be able to factor these attributes into the results of the test (e.g. latency) - in other tests, you only need to gather the results.

Of course, the concept of a "use idle machines to do something" isn't exactly new - things like distributed.net, seti@home and others do this all the time as I mentioned before.

Then you have things like buildbot - buildbot uses a dedicated (or partially dedicated) pool of machines to compile a target and execute the local unit tests against the compiled thing.

Why not make the two go hand in hand and make an intelligent weighted selection for test distribution? Let's go back to the localized example. You have a continuous build system which compiles and run units. It then looks at a pool of test-peers who have volunteered to be part of the test-swarm and fires off the functional/regression tests (as needed, it can locally deploy or remotely deploy to a test-server).

The buildbot reports the steps as compile: pass, units: pass, and then regression: pending - the buildbot passes out the various tests to the swarm which can be executed asynchronously until all tests are completed (or error'd at which point they're passed back to another client in the swarm).

The nice thing is that this works on both a local LAN, and a globally distributed series of test swarm participants. All you do is weight in favor of the closer clients. (oh, and your application has to be available on the network).

Over time, peers participating in the swarm can be "pushed out" - meaning they have error'd out too many times, have been caught "lying" and so on. The swarm can adapt - clients can come and go as long as a given passed out suite eventually completes. If a client fails/drops, the test is simple re-passed out.

On a localized (meaning, internal-to-your-company) level, this means you can make any client on your network a peer on the system, and the weight-based selection system still applies and you can use any type of system on your LAN - desktops, servers, highly intelligent coffee makers - anything with a network drop.

Additionally, you could point test slaves at a cluster of installed system-under-tests - individual nodes in a web farm, or your application installed on various web hosts. Or a larger system installed in various data centers. This removes the bottleneck of a singular system being tested at once (but requires a lot of intelligence on the managerial level).

It's an idea. Something of a disconnected series of thoughts - maybe it's silly. I like the idea of being able to intelligently leverage a series of test peers distributed anywhere and everywhere. Having a peer-to-peer testing system would be neat-o.

It's a zombie army used for testing -Anon :)

edit: Yes, a loosely coupled, highly distributed load test could be construed as a DDoS... But that's semantics, right?

References/Interesting Reading: