Making re-creatable random data files really fast in python.

May 30th, 2008 § 9 comments

Note, I’m just really happy with this — feel free to cor­rect me or give me enhancements.

So — let’s state the problem:

  • I have to cre­ate a lot of files of vary­ing sizes
  • I can not store them long-term
  • I must be able to recre­ate them at any point
  • Cre­ation must be fast for files large, and small

That all being said — I wanted to be able to use a seed made of inte­gers which only make sense to me that embeds cer­tain data rel­e­vant to the test within it, so the seed would have both ran­dom and not-random inte­gers in it.

I also wanted to avoid using /dev/random and /dev/urandom — both are decep­tively fast until you fire it up using a bunch of threads and drain your entropy pool. Not to men­tion — I want it fast, so I don’t want to have an extra read() call. I need the data put in the file to be a “known thing” — i.e: ran­domly gen­er­ated from a non-random pool of data (a words file).

Ergo, this:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import collections
import os
 
seed = "1092384956781341341234656953214543219"
words = open("lorem.txt", "r").read().replace("\n", '').split()
 
def fdata():
    a = collections.deque(words)
    b = collections.deque(seed)
    while True:
        yield ' '.join(list(a)[0:1024])
        a.rotate(int(b[0]))
        b.rotate(1)
 
g = fdata()
size = 1073741824 # 1gb
fname = "test.out"
fh = open(fname, 'w')
while os.path.getsize(fname) < size:
    fh.write(g.next())

lorem.txt is from here — it’s just a Lorem Ipsum file. On my machine I can gen­er­ate a 1 gb file in 28 sec­onds on disk, the bonus is that I don’t need to write the data in the final test — I just need to pro­vide it to the caller.

It’s not as opti­mized as it could be: I could read big­ger chunks of data, things like that. If I dropped the os.path.getsize, it might get faster (count the num­ber of chunks from size / 1024) but that lim­its me to know­ing the chunk size of the generator.

But — I meet my cri­te­ria — and can gen­er­ate large amounts of file data fast, in a re-creatable form.

Oh well. Just some­thing cool on a fri­day before motor­cy­cle class.

  • http://www.dowski.com Chris­t­ian Wyglendowski

    Hey, that is cool. Can’t think of where I’d need this right away, but still, very cool.

  • Anony­mous

    The eas­i­est way to elim­i­nate the get­size call would be some­thing like:

    for chunk in fdata:
    fh.write(chunk)
    size -= len(chunk)
    if size < 0:
    break

  • Sergey

    Nice! Unless I’m miss­ing some­thing, .replace(“n”, ”).split() could be just .split(). New­line is one of the default sep­a­ra­tors: http://www.python.org/doc/current/lib/node42.ht….

  • Anony­mous

    Hmm. Let’s see if this works better:

    for chunk in fdata():
    fh.write(chunk)
    size -= len(chunk)
    if size < 0:
    break

  • jnoller

    Thanks, I’ll give that a shot too

  • Eric Brun­son

    Why not use a pseudo-random num­ber gen­er­a­tor and start with a known seed? Seed­ing Python’s built in PRNG will result in the same sequence and is fast. Am I miss­ing something?

  • jnoller

    For some rea­son (which I never got a chance to look into) calls into ran­dom are expensive/slow, which is why I did it this way.

  • Eric Brun­son

    Why not use a pseudo-random num­ber gen­er­a­tor and start with a known seed? Seed­ing Python’s built in PRNG will result in the same sequence and is fast. Am I miss­ing something?

  • http://jessenoller.com jnoller

    For some rea­son (which I never got a chance to look into) calls into ran­dom are expensive/slow, which is why I did it this way.

What's this?

You are currently reading Making re-creatable random data files really fast in python. at jessenoller.com.

meta