Making re-creatable random data files really fast in python.

by jesse in ,

Note, I'm just really happy with this - feel free to correct me or give me enhancements. So - let's state the problem:

  • I have to create a lot of files of varying sizes
  • I can not store them long-term
  • I must be able to recreate them at any point
  • Creation must be fast for files large, and small

That all being said - I wanted to be able to use a seed made of integers which only make sense to me that embeds certain data relevant to the test within it, so the seed would have both random and not-random integers in it.

I also wanted to avoid using /dev/random and /dev/urandom - both are deceptively fast until you fire it up using a bunch of threads and drain your entropy pool. Not to mention - I want it fast, so I don't want to have an extra read() call. I need the data put in the file to be a "known thing" - i.e: randomly generated from a non-random pool of data (a words file).

Ergo, this:

import collections
import os

seed = "1092384956781341341234656953214543219"
words = open("lorem.txt", "r").read().replace("\n", '').split()

def fdata():
    a = collections.deque(words)
    b = collections.deque(seed)
    while True:
        yield ' '.join(list(a)[0:1024])

g = fdata()
size = 1073741824 # 1gb
fname = "test.out"
fh = open(fname, 'w')
while os.path.getsize(fname) < size:

lorem.txt is from here - it's just a Lorem Ipsum file. On my machine I can generate a 1 gb file in 28 seconds on disk, the bonus is that I don't need to write the data in the final test - I just need to provide it to the caller.

It's not as optimized as it could be: I could read bigger chunks of data, things like that. If I dropped the os.path.getsize, it might get faster (count the number of chunks from size / 1024) but that limits me to knowing the chunk size of the generator.

But - I meet my criteria - and can generate large amounts of file data fast, in a re-creatable form.

Oh well. Just something cool on a friday before motorcycle class.