A Peer to Peer test distribution system (TestBot)?

Peer-to-Peer systems aren’t something new. Things like Bittorrent, AllMyData Tahoe, and others have been using it for file storage for some time.

Still others use the distributed-worker methodologies to do work parceling – they register with the system, and the system hands out chunks of work without factoring in client speed/etc (e.g. distributed.net).

What if you combined the two – you used something like Bittorrent which does peer-selection and allocation intelligently, with a large distributed architecture to manage large scale test execution?

Let’s think about a common problem with test engineering. Start with a simple version – you’re designing a load test app, this app needs to generate large amounts of load against a target system.

In a normal test environment in a lab – this is “easy” – you simply make sure you have a lab with a bunch of clients, all on the same LAN and you run a test client from all of them that generate load against the system under test.

Now, let’s complicate the problem: You don’t have enough “same same” test clients. You may have some “close enough” but dang – they’re not on the same subnet, or you don’t know about them. Not having enough clients in a lab is more common than you’d think.

So how do you make a test that can take advantage of those test clients, factor in their “differences” and still make a relevant test?

Next problem. You have an application you want to run a battery of tests against. You don’t have a dedicated client, but you have the possibility of “borrowing time” from some idle machines to run those tests.

The “idle machines” all have different ram, CPU and are varying distances from the system under test on the network. You need to 1> Find them, 2> Figure out which of the available test clients is the most desirable 3> Be able to figure out the main differences between the clients to factor them into results.

You simply want the more capable clients to get more of the “important” tests, and the less capable ones to run the lesser tests. Just to add to it, you want them to possibly be capable of being slaved to a given test to help it along (i.e. a performance or generalized load generation test).

Getting back to the original thought about peer-to-peer systems, I started considering the possibility of applying the peer to peer paradigm/weighted selection to test distribution.

You have a series of clients who volunteer to participate in the swarm. The client responsible for submitting the job (a test) to the swarm would use a Weighted Voting algorithm to rank, sort and choose the “most desirable” clients to distribute a test to.

Each client would respond to a submitted request with various attributes (weights) based on OS Type, number of hops from the client submitting the job and the system-under-test, amount of ram, network speed and so on.

In the case of performance based tests, you would be able to factor these attributes into the results of the test (e.g. latency) – in other tests, you only need to gather the results.

Of course, the concept of a “use idle machines to do something” isn’t exactly new – things like distributed.net, seti@home and others do this all the time as I mentioned before.

Then you have things like buildbot – buildbot uses a dedicated (or partially dedicated) pool of machines to compile a target and execute the local unit tests against the compiled thing.

Why not make the two go hand in hand and make an intelligent weighted selection for test distribution? Let’s go back to the localized example. You have a continuous build system which compiles and run units. It then looks at a pool of test-peers who have volunteered to be part of the test-swarm and fires off the functional/regression tests (as needed, it can locally deploy or remotely deploy to a test-server).

The buildbot reports the steps as compile: pass, units: pass, and then regression: pending – the buildbot passes out the various tests to the swarm which can be executed asynchronously until all tests are completed (or error’d at which point they’re passed back to another client in the swarm).

The nice thing is that this works on both a local LAN, and a globally distributed series of test swarm participants. All you do is weight in favor of the closer clients. (oh, and your application has to be available on the network).

Over time, peers participating in the swarm can be “pushed out” – meaning they have error’d out too many times, have been caught “lying” and so on. The swarm can adapt – clients can come and go as long as a given passed out suite eventually completes. If a client fails/drops, the test is simple re-passed out.

On a localized (meaning, internal-to-your-company) level, this means you can make any client on your network a peer on the system, and the weight-based selection system still applies and you can use any type of system on your LAN – desktops, servers, highly intelligent coffee makers – anything with a network drop.

Additionally, you could point test slaves at a cluster of installed system-under-tests – individual nodes in a web farm, or your application installed on various web hosts. Or a larger system installed in various data centers. This removes the bottleneck of a singular system being tested at once (but requires a lot of intelligence on the managerial level).

It’s an idea. Something of a disconnected series of thoughts – maybe it’s silly. I like the idea of being able to intelligently leverage a series of test peers distributed anywhere and everywhere. Having a peer-to-peer testing system would be neat-o.

It’s a zombie army used for testing -Anon :)

edit: Yes, a loosely coupled, highly distributed load test could be construed as a DDoS… But that’s semantics, right?

References/Interesting Reading:

  • We have something like that in Resolver Systems - we call it "distributed build".

    It builds your working copy, copies it to the machines you specify in the LAN and publishes the list of tests it wants to be run.

    A central server assigns tests to machines and gathers the result back. It's not clever, but it does its job. I've always wanted to make it more smart so that new machines could be added transparently when they are idle, but it was always too much work...
  • Yeah - I'm taking the distributed build thing a bit farther. I want a (globally)disparate series of test clients available to test any application, where those clients might test the app "locally" - in the case of desktop apps for instance, or they might execute a test passed to them which uses the client as CPU to bind to a larger test "in the swarm".

    I've written three different "manager passes tests to clients" and gets results back - those are easy, and relatively dumb.

    I would much rather have the slaves register with the server and provide the weight number which indicates the "desirability" of the client - a simple way of doing this would be to pass an object across the wire containing the .network_hops_from_target, .ram, .cpu and so on attributes, and do the calculations on manager-side.

    Doing it the simple way has the additional benefit of allowing a test-in-the-queue to dictate what attribute it's more interested in. Of course, if you calculate the weight numbers the clients pass back correctly - you don't need the tests to dictate what it "wants".

    Argh, it's still a jumble of ideas.
  • terry peppers
    Jesse -

    You mention Skoll. And I'm sure you've already seen Adam Porter and Atif Memon talk from GTAC last year. Very interesting concepts.

    http://www.youtube.com/watch?v=OiE9zRPD6ps
  • Yup. I wanted to go to GTAC last year (and this year too) but didn't have the chance. Skoll is interesting for a variety of reasons, and there are some parallels to what I am talking about in the distributed-slave-sense. I'm gleaning what seems to be the more meaty parts of Skoll from the papers and publications.
blog comments powered by Disqus