Generating re-creatable random files…

February 27th, 2009 § 7 comments

… And the case of obses­sive opti­miza­tion. A lit­tle while ago, I posted a small snip­pet of code that was designed to gen­er­ate data files of a given size, based off a seed very quickly (arti­cle here). The goals of this code is/was the following:

  • Gen­er­ate large amounts of semi-random data quickly
  • Data gen­er­a­tion can not use /dev/urandom or other sys­tem entropy buck­ets. These are to slow, and hav­ing hun­dred of threads pulling from these buck­ets is a bad idea. Oh — and it needs to work on windows.
  • The data must never be sync’ed to disk: when you’re gen­er­at­ing a large data set, on the scale of hun­dreds of mil­lions of files, stor­ing it on disk sucks, and the disk becomes the bottleneck.
  • Cre­ation of the files must be at least 1 gigabit/second — this means a sin­gle thread pass­ing one of these gen­er­a­tors to say, a pycurl han­dle could “in the­ory” hit line speed: the gen­er­a­tor can not be the bottleneck
  • The data in the­ses files must be able to be recre­ated at any time pro­vided you have the seed.
  • Set­ting a seed in python’s ran­dom() has side-effect issues, and can not be used. Besides, lots of ran­dom calls are expensive.
  • I need the abil­ity to swap out the data source, I use a lorem file here, but a dif­fer­ent type will be needed later.
  • The data source should only be parsed once for the import (sin­gle­ton, ho!)
  • The name, and the file data must be unique — they must hash dif­fer­ently (to pre­vent de-dupers from, well, de dup­ing them)

I am revis­it­ing this code as we found out the orig­i­nal ver­sion could only gen­er­ate file data at around 500 megabits/second. This is much too slow for my tastes, as I might as well be read­ing it from disk. We can make it faster.

After clean­ing things up, remov­ing some overly com­plex logic (and sev­eral moments of “what the hell was I think­ing”), I came up with this:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
LOREM = os.path.join(os.path.dirname(__file__), "datafiles", "lorem.txt")
WORDS = open(LOREM, "r").read().split()
 
def chunker(size, seed, chunksize=1000):
    word_q = collections.deque(WORDS)
    seed_q = collections.deque(int(i) for i in str(seed))
    # Rotate the word_q by the seed so that small files are unique.
    word_q.rotate(seed)
    current_size = size
    while current_size > 0:
        data = ' '.join(word_q)
        if chunksize > current_size:
            chunksize = current_size
        chunksize = (yield data[0:chunksize]) or chunksize
        current_size -= chunksize
        word_q.rotate(seed_q[0])
        seed_q.rotate(1)
 
class SyntheticFile(object):
    """ File-Like object backed by the ``chunker`` function. Allows the
    construction of an object which can be passed to something like a pycurl
    handle streaming data to a server """
    def __init__(self, size, seed):
        """ 
        **size**: integer, bytes
        **seed**: integer
        **chunksize**: optional, integer
        """
        self.chunker = None
        self.size = size
        self.seed = seed
 
    def write(self):
        """ unsupported, throw an error if called """
        raise Exception('not supported')
 
    def read(self, readsize):
        """ Support read() - **readsize** is in bytes. """
        if not self.chunker:
            self.chunker = chunker(self.size, self.seed, readsize)
            return self.chunker.next()
        try:
            return self.chunker.send(readsize or 1000)
        except StopIteration:
            pass
        return ""

This ver­sion hit around 618 megabits/second and it used the generator’s send() capa­bil­ity to allow read­ers using the Syn­thet­ic­File imple­men­ta­tion to alter the chunk size they’re read­ing on the fly, which is impor­tant if you have a con­sumer that wants the abil­ity to read small/read big/read small. Well, that’s fine and all, but I was stymied — I wanted to make this thing fly. I want to be able to gen­er­ate this data at at least 1 gigabit/second, if not faster.

Astute read­ers may point out that there’s other ways of doing this — mmap, sim­ply embed­ding the unique seed or a uuid — well, this story isn’t about that, is it?

In any case, I sus­pected the “data = ’ ‘.join(word_q)” line was the cul­prit — deque is pretty opti­mized, and I had removed a mas­sive chunk of code which didn’t make sense, and in fact, cPro­file showed I was right:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441475 function calls in 203.791 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352  180.282    0.000  180.282    0.000 {method 'join' of 'str'  objects}
...snip...

180 out of 203 cpu sec­onds, on the join alone. Curses! So this is when I really went men­tal (this is what hap­pens when you’re too close to some­thing). I decided that I needed to find some mag­i­cal way of skip­ping the join and only read­ing what I needed. I ran down that rathole for a bit, until a friend of mine point out “just make the words bigger”.

Full stop. I ini­tially dis­counted it, I was zeroed in on that join — oh wait. The text in the lorem file when split on white­space is 4368 words. Join­ing those back together within the loop is expen­sive — that much I knew. I hit on the idea that if instead of con­sid­er­ing them words, I thought of them as chunks (which is how I was treat­ing them).

I added a method (process_chunks) which treated the data source as chunks of bytes and made the WORDS vari­able a list of those chunks. Ini­tially, I set the chunk size to 100 (bytes) and here’s the cPro­file output:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441766 function calls in 51.549 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352   29.712    0.000   29.712    0.000 {method 'join' of 'str' objects}
...snip...

And now the gen­er­a­tor is kick­ing data out at 2.34 giga­bits. Huge suc­cess. Obvi­ously, if you increase the chunk size, it speeds up a bit more (e.g. 300 byte chunks is about 2.5 gigabits/second). I cleaned it up a bit and here is the code:

(thanks! bitbucket.org).
Note that the speeds I’m dis­cussing are pass­ing the Syn­th­FileOb­ject to a pycurl han­dle and stream­ing it across the wire: not to disk.

All told, it was a fun lit­tle jaunt, and I’ve suc­ceeded to make some­thing which I con­sid­ered “throw­away” into some­thing that’s a lot more use­ful, clean and fast. I’ve added a hand­ful of unit tests to my sand­box, and I might make this a real mod­ule if any­one wants it. I want to rework the _process_chunk/globals stuff, but I farted around with this long enough for now. I also want to add the abil­ity to remove the chunk­ing alto­gether and sim­ply insert the seed into the data response, and not mess with the lorem text.

edit: I just checked in a new ver­sion which removes the _process_chunks func­tion and other glob­als and moves them into a class. I hate globals.

  • http://mishkovskyi.net/blog mishok13

    ”“618 gigabits/second””” Srsly? ;)

  • jnoller

    I fixed the typo. Low on cof­fee, high on speed.

  • RobK

    I have absolutely no idea what this means, but I think I’m aroused. ;)

  • http://mishkovskyi.net/blog mishok13

    ”“618 gigabits/second””” Srsly? ;)

  • http://mishkovskyi.net/blog mishok13

    ”“618 gigabits/second””” Srsly, that much? ;)

  • http://jessenoller.com jnoller

    I fixed the typo. Low on cof­fee, high on speed.

  • RobK

    I have absolutely no idea what this means, but I think I’m aroused. ;)

What's this?

You are currently reading Generating re-creatable random files… at jessenoller.com.

meta