Making re-creatable random data files really fast in python.

Note, I’m just really happy with this – feel free to correct me or give me enhancements.

So – let’s state the problem:

  • I have to create a lot of files of varying sizes
  • I can not store them long-term
  • I must be able to recreate them at any point
  • Creation must be fast for files large, and small

That all being said – I wanted to be able to use a seed made of integers which only make sense to me that embeds certain data relevant to the test within it, so the seed would have both random and not-random integers in it.

I also wanted to avoid using /dev/random and /dev/urandom – both are deceptively fast until you fire it up using a bunch of threads and drain your entropy pool. Not to mention – I want it fast, so I don’t want to have an extra read() call. I need the data put in the file to be a “known thing” – i.e: randomly generated from a non-random pool of data (a words file).

Ergo, this:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import collections
import os
 
seed = "1092384956781341341234656953214543219"
words = open("lorem.txt", "r").read().replace("\n", '').split()
 
def fdata():
    a = collections.deque(words)
    b = collections.deque(seed)
    while True:
        yield ' '.join(list(a)[0:1024])
        a.rotate(int(b[0]))
        b.rotate(1)
 
g = fdata()
size = 1073741824 # 1gb
fname = "test.out"
fh = open(fname, 'w')
while os.path.getsize(fname) < size:
    fh.write(g.next())

lorem.txt is from here – it’s just a Lorem Ipsum file. On my machine I can generate a 1 gb file in 28 seconds on disk, the bonus is that I don’t need to write the data in the final test – I just need to provide it to the caller.

It’s not as optimized as it could be: I could read bigger chunks of data, things like that. If I dropped the os.path.getsize, it might get faster (count the number of chunks from size / 1024) but that limits me to knowing the chunk size of the generator.

But – I meet my criteria – and can generate large amounts of file data fast, in a re-creatable form.

Oh well. Just something cool on a friday before motorcycle class.

  • Eric Brunson
    Why not use a pseudo-random number generator and start with a known seed? Seeding Python's built in PRNG will result in the same sequence and is fast. Am I missing something?
  • For some reason (which I never got a chance to look into) calls into random are expensive/slow, which is why I did it this way.
  • Sergey
    Nice! Unless I'm missing something, .replace("\n", '').split() could be just .split(). Newline is one of the default separators: http://www.python.org/doc/current/lib/node42.ht....
  • Thanks, I'll give that a shot too
  • Anonymous
    The easiest way to eliminate the getsize call would be something like:

    for chunk in fdata:
    fh.write(chunk)
    size -= len(chunk)
    if size < 0:
    break
  • Anonymous
    Hmm. Let's see if this works better:

    for chunk in fdata():
    fh.write(chunk)
    size -= len(chunk)
    if size < 0:
    break
  • Hey, that is cool. Can't think of where I'd need this right away, but still, very cool.
blog comments powered by Disqus