Generating re-creatable random files...

by jesse in , ,


... And the case of obsessive optimization. A little while ago, I posted a small snippet of code that was designed to generate data files of a given size, based off a seed very quickly (article here). The goals of this code is/was the following:

  • Generate large amounts of semi-random data quickly
  • Data generation can not use /dev/urandom or other system entropy buckets. These are to slow, and having hundred of threads pulling from these buckets is a bad idea. Oh - and it needs to work on windows.
  • The data must never be sync'ed to disk: when you're generating a large data set, on the scale of hundreds of millions of files, storing it on disk sucks, and the disk becomes the bottleneck.
  • Creation of the files must be at least 1 gigabit/second - this means a single thread passing one of these generators to say, a pycurl handle could "in theory" hit line speed: the generator can not be the bottleneck
  • The data in theses files must be able to be recreated at any time provided you have the seed.
  • Setting a seed in python's random() has side-effect issues, and can not be used. Besides, lots of random calls are expensive.
  • I need the ability to swap out the data source, I use a lorem file here, but a different type will be needed later.
  • The data source should only be parsed once for the import (singleton, ho!)
  • The name, and the file data must be unique - they must hash differently (to prevent de-dupers from, well, de duping them)

I am revisiting this code as we found out the original version could only generate file data at around 500 megabits/second. This is much too slow for my tastes, as I might as well be reading it from disk. We can make it faster.

After cleaning things up, removing some overly complex logic (and several moments of "what the hell was I thinking"), I came up with this:

LOREM = os.path.join(os.path.dirname(__file__), "datafiles", "lorem.txt")
WORDS = open(LOREM, "r").read().split()

def chunker(size, seed, chunksize=1000):
    word_q = collections.deque(WORDS)
    seed_q = collections.deque(int(i) for i in str(seed))
    # Rotate the word_q by the seed so that small files are unique.
    word_q.rotate(seed)
    current_size = size
    while current_size > 0:
        data = ' '.join(word_q)
        if chunksize > current_size:
            chunksize = current_size
        chunksize = (yield data[0:chunksize]) or chunksize
        current_size -= chunksize
        word_q.rotate(seed_q[0])
        seed_q.rotate(1)

class SyntheticFile(object):
    """ File-Like object backed by the ``chunker`` function. Allows the
    construction of an object which can be passed to something like a pycurl
    handle streaming data to a server """
    def __init__(self, size, seed):
        """ 
        **size**: integer, bytes
        **seed**: integer
        **chunksize**: optional, integer
        """
        self.chunker = None
        self.size = size
        self.seed = seed

    def write(self):
        """ unsupported, throw an error if called """
        raise Exception('not supported')

    def read(self, readsize):
        """ Support read() - **readsize** is in bytes. """
        if not self.chunker:
            self.chunker = chunker(self.size, self.seed, readsize)
            return self.chunker.next()
        try:
            return self.chunker.send(readsize or 1000)
        except StopIteration:
            pass
        return ""

This version hit around 618 megabits/second and it used the generator's send() capability to allow readers using the SyntheticFile implementation to alter the chunk size they're reading on the fly, which is important if you have a consumer that wants the ability to read small/read big/read small. Well, that's fine and all, but I was stymied - I wanted to make this thing fly. I want to be able to generate this data at at least 1 gigabit/second, if not faster.

Astute readers may point out that there's other ways of doing this - mmap, simply embedding the unique seed or a uuid - well, this story isn't about that, is it?

In any case, I suspected the "data = ' '.join(word_q)" line was the culprit - deque is pretty optimized, and I had removed a massive chunk of code which didn't make sense, and in fact, cProfile showed I was right:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441475 function calls in 203.791 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352  180.282    0.000  180.282    0.000 {method 'join' of 'str'  objects}
...snip...

180 out of 203 cpu seconds, on the join alone. Curses! So this is when I really went mental (this is what happens when you're too close to something). I decided that I needed to find some magical way of skipping the join and only reading what I needed. I ran down that rathole for a bit, until a friend of mine point out "just make the words bigger".

Full stop. I initially discounted it, I was zeroed in on that join - oh wait. The text in the lorem file when split on whitespace is 4368 words. Joining those back together within the loop is expensive - that much I knew. I hit on the idea that if instead of considering them words, I thought of them as chunks (which is how I was treating them).

I added a method (process_chunks) which treated the data source as chunks of bytes and made the WORDS variable a list of those chunks. Initially, I set the chunk size to 100 (bytes) and here's the cProfile output:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441766 function calls in 51.549 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352   29.712    0.000   29.712    0.000 {method 'join' of 'str' objects}
...snip...

And now the generator is kicking data out at 2.34 gigabits. Huge success. Obviously, if you increase the chunk size, it speeds up a bit more (e.g. 300 byte chunks is about 2.5 gigabits/second). I cleaned it up a bit and here is the code: (thanks! bitbucket.org). Note that the speeds I'm discussing are passing the SynthFileObject to a pycurl handle and streaming it across the wire: not to disk.

All told, it was a fun little jaunt, and I've succeeded to make something which I considered "throwaway" into something that's a lot more useful, clean and fast. I've added a handful of unit tests to my sandbox, and I might make this a real module if anyone wants it. I want to rework the _process_chunk/globals stuff, but I farted around with this long enough for now. I also want to add the ability to remove the chunking altogether and simply insert the seed into the data response, and not mess with the lorem text.

edit: I just checked in a new version which removes the _process_chunks function and other globals and moves them into a class. I hate globals.