Making re-creatable random data files really fast in python.
Note, I'm just really happy with this - feel free to correct me or give me enhancements.
So - let's state the problem:
- I have to create a lot of files of varying sizes
- I can not store them long-term
- I must be able to recreate them at any point
- Creation must be fast for files large, and small
That all being said - I wanted to be able to use a seed made of integers which only make sense to me that embeds certain data relevant to the test within it, so the seed would have both random and not-random integers in it.
I also wanted to avoid using /dev/random and /dev/urandom - both are deceptively fast until you fire it up using a bunch of threads and drain your entropy pool. Not to mention - I want it fast, so I don't want to have an extra read() call. I need the data put in the file to be a "known thing" - i.e: randomly generated from a non-random pool of data (a words file).
Ergo, this:
import collections import os seed = "1092384956781341341234656953214543219" words = open("lorem.txt", "r").read().replace("\n", '').split() def fdata(): a = collections.deque(words) b = collections.deque(seed) while True: yield ' '.join(list(a)[0:1024]) a.rotate(int(b[0])) b.rotate(1) g = fdata() size = 1073741824 # 1gb fname = "test.out" fh = open(fname, 'w') while os.path.getsize(fname) < size: fh.write(g.next())
lorem.txt is from here - it's just a Lorem Ipsum file. On my machine I can generate a 1 gb file in 28 seconds on disk, the bonus is that I don't need to write the data in the final test - I just need to provide it to the caller.
It's not as optimized as it could be: I could read bigger chunks of data, things like that. If I dropped the os.path.getsize, it might get faster (count the number of chunks from size / 1024) but that limits me to knowing the chunk size of the generator.
But - I meet my criteria - and can generate large amounts of file data fast, in a re-creatable form.
Oh well. Just something cool on a friday before motorcycle class.

