Google Testing Blog: "There, but for the grace of testing, go I"

by jesse in , ,

The Google Testing Blog has a good post up right now by James Whittaker called "There, but for the grace of testing, go I" - it's a good read, and a pertinent one for any of you/us who feel strongly about quality.

Even though I've spent more time then not on "the other side" of the table (Developer, noun - "focus on making software (ergo, bugs)") I find that James' words ring pretty loudly for me still, especially his part on risk analysis:

I am thankful that the vast majority of bugs that affect entire user populations are generally nuisance-class issues. These are typically bugs concerning awkward UI elements or the occasional misfiring of some feature or another where workarounds and alternatives will suffice until a minor update can be made. Serious bugs tend to have a more localized effect. True recall class bugs, serious failures that affect large populations of users, are far less common. Testers can take advantage of the fact that not all bugs are equally damaging and prioritize their effort to find bugs in the order of their seriousness. The futility of finding every bug can be replaced by an investigation based on risk.

I'd recommend James' post amongst the others there on that blog - it reminded me of an old rant of mine "The cost of (not) testing software". Anyone in the business of making something is also in the business of making bugs. It's important for us to keep that in mind when we deal with our day to day job - and when we think about our customers. It's also important for us to keep that in mind when criticizing or dragging any person or company or code through the muck.

Generating re-creatable random files...

by jesse in , ,

... And the case of obsessive optimization. A little while ago, I posted a small snippet of code that was designed to generate data files of a given size, based off a seed very quickly (article here). The goals of this code is/was the following:

  • Generate large amounts of semi-random data quickly
  • Data generation can not use /dev/urandom or other system entropy buckets. These are to slow, and having hundred of threads pulling from these buckets is a bad idea. Oh - and it needs to work on windows.
  • The data must never be sync'ed to disk: when you're generating a large data set, on the scale of hundreds of millions of files, storing it on disk sucks, and the disk becomes the bottleneck.
  • Creation of the files must be at least 1 gigabit/second - this means a single thread passing one of these generators to say, a pycurl handle could "in theory" hit line speed: the generator can not be the bottleneck
  • The data in theses files must be able to be recreated at any time provided you have the seed.
  • Setting a seed in python's random() has side-effect issues, and can not be used. Besides, lots of random calls are expensive.
  • I need the ability to swap out the data source, I use a lorem file here, but a different type will be needed later.
  • The data source should only be parsed once for the import (singleton, ho!)
  • The name, and the file data must be unique - they must hash differently (to prevent de-dupers from, well, de duping them)

I am revisiting this code as we found out the original version could only generate file data at around 500 megabits/second. This is much too slow for my tastes, as I might as well be reading it from disk. We can make it faster.

After cleaning things up, removing some overly complex logic (and several moments of "what the hell was I thinking"), I came up with this:

LOREM = os.path.join(os.path.dirname(__file__), "datafiles", "lorem.txt")
WORDS = open(LOREM, "r").read().split()

def chunker(size, seed, chunksize=1000):
    word_q = collections.deque(WORDS)
    seed_q = collections.deque(int(i) for i in str(seed))
    # Rotate the word_q by the seed so that small files are unique.
    current_size = size
    while current_size > 0:
        data = ' '.join(word_q)
        if chunksize > current_size:
            chunksize = current_size
        chunksize = (yield data[0:chunksize]) or chunksize
        current_size -= chunksize

class SyntheticFile(object):
    """ File-Like object backed by the ``chunker`` function. Allows the
    construction of an object which can be passed to something like a pycurl
    handle streaming data to a server """
    def __init__(self, size, seed):
        **size**: integer, bytes
        **seed**: integer
        **chunksize**: optional, integer
        self.chunker = None
        self.size = size
        self.seed = seed

    def write(self):
        """ unsupported, throw an error if called """
        raise Exception('not supported')

    def read(self, readsize):
        """ Support read() - **readsize** is in bytes. """
        if not self.chunker:
            self.chunker = chunker(self.size, self.seed, readsize)
            return self.chunker.send(readsize or 1000)
        except StopIteration:
        return ""

This version hit around 618 megabits/second and it used the generator's send() capability to allow readers using the SyntheticFile implementation to alter the chunk size they're reading on the fly, which is important if you have a consumer that wants the ability to read small/read big/read small. Well, that's fine and all, but I was stymied - I wanted to make this thing fly. I want to be able to generate this data at at least 1 gigabit/second, if not faster.

Astute readers may point out that there's other ways of doing this - mmap, simply embedding the unique seed or a uuid - well, this story isn't about that, is it?

In any case, I suspected the "data = ' '.join(word_q)" line was the culprit - deque is pretty optimized, and I had removed a massive chunk of code which didn't make sense, and in fact, cProfile showed I was right:

woot:synthfiles jesse$ python -m cProfile
         2441475 function calls in 203.791 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   610352  180.282    0.000  180.282    0.000 {method 'join' of 'str'  objects}

180 out of 203 cpu seconds, on the join alone. Curses! So this is when I really went mental (this is what happens when you're too close to something). I decided that I needed to find some magical way of skipping the join and only reading what I needed. I ran down that rathole for a bit, until a friend of mine point out "just make the words bigger".

Full stop. I initially discounted it, I was zeroed in on that join - oh wait. The text in the lorem file when split on whitespace is 4368 words. Joining those back together within the loop is expensive - that much I knew. I hit on the idea that if instead of considering them words, I thought of them as chunks (which is how I was treating them).

I added a method (process_chunks) which treated the data source as chunks of bytes and made the WORDS variable a list of those chunks. Initially, I set the chunk size to 100 (bytes) and here's the cProfile output:

woot:synthfiles jesse$ python -m cProfile
         2441766 function calls in 51.549 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   610352   29.712    0.000   29.712    0.000 {method 'join' of 'str' objects}

And now the generator is kicking data out at 2.34 gigabits. Huge success. Obviously, if you increase the chunk size, it speeds up a bit more (e.g. 300 byte chunks is about 2.5 gigabits/second). I cleaned it up a bit and here is the code: (thanks! Note that the speeds I'm discussing are passing the SynthFileObject to a pycurl handle and streaming it across the wire: not to disk.

All told, it was a fun little jaunt, and I've succeeded to make something which I considered "throwaway" into something that's a lot more useful, clean and fast. I've added a handful of unit tests to my sandbox, and I might make this a real module if anyone wants it. I want to rework the _process_chunk/globals stuff, but I farted around with this long enough for now. I also want to add the ability to remove the chunking altogether and simply insert the seed into the data response, and not mess with the lorem text.

edit: I just checked in a new version which removes the _process_chunks function and other globals and moves them into a class. I hate globals.

TestButler update (updated)

by jesse in , ,

With the much-appreciated help of Brandon Barry (with whom I just happen to work) - there's been a needed update to the testbutler code base I couldn't get to - some highlights:

  • Cleaned up the CSS, moved to blueprint for the larger portion of the CSS and the start of jQuery usage for the javascript portions
  • Templates have been cleaned up/gotten a major facelift
  • site-media has been cleaned up
  • Deleted unused code I had in the prototype
  • Models cleaned up

Overall, it's looking much better. Yes, it's in a production-use now, and I love markdown syntax. You can see some screen shots here.

We ditched the roomba-picture, I need to find someone handy with artwork to maybe make some custom icons/pics for us (I really want a cartoony-robot-butler)

There's a lot I'd like to do, obviously - but first I have to get started on the results trackers and the corresponding nose plugin to feed the results to the system. I figure I am going to use the nose-xunit plugin and some custom XSLT for now.

One thing I need to figure out is if the django-markdown plugin allows for relative %url% links within a block of text so we can cross-link testcases, I may write a custom template tag.

Edit: Additionally, I just committed a change to remove all notion of "component" from the system. We decided that a test case could have any number of components, or none at all, and that it was more logically consistent to track those as tags-in-the-cloud. For example, a given test case might be tagged "gui, regression, smoketest, performance" or "smoke, storage, gui" etc. Being more flexible with sorting and organization was our goal.