Note, I’m just really happy with this — feel free to correct me or give me enhancements.
So — let’s state the problem:
- I have to create a lot of files of varying sizes
- I can not store them long-term
- I must be able to recreate them at any point
- Creation must be fast for files large, and small
That all being said — I wanted to be able to use a seed made of integers which only make sense to me that embeds certain data relevant to the test within it, so the seed would have both random and not-random integers in it.
I also wanted to avoid using /dev/random and /dev/urandom — both are deceptively fast until you fire it up using a bunch of threads and drain your entropy pool. Not to mention — I want it fast, so I don’t want to have an extra read() call. I need the data put in the file to be a “known thing” — i.e: randomly generated from a non-random pool of data (a words file).
Ergo, this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import collections import os seed = "1092384956781341341234656953214543219" words = open("lorem.txt", "r").read().replace("\n", '').split() def fdata(): a = collections.deque(words) b = collections.deque(seed) while True: yield ' '.join(list(a)[0:1024]) a.rotate(int(b[0])) b.rotate(1) g = fdata() size = 1073741824 # 1gb fname = "test.out" fh = open(fname, 'w') while os.path.getsize(fname) < size: fh.write(g.next()) |
lorem.txt is from here — it’s just a Lorem Ipsum file. On my machine I can generate a 1 gb file in 28 seconds on disk, the bonus is that I don’t need to write the data in the final test — I just need to provide it to the caller.
It’s not as optimized as it could be: I could read bigger chunks of data, things like that. If I dropped the os.path.getsize, it might get faster (count the number of chunks from size / 1024) but that limits me to knowing the chunk size of the generator.
But — I meet my criteria — and can generate large amounts of file data fast, in a re-creatable form.
Oh well. Just something cool on a friday before motorcycle class.
-
http://www.dowski.com Christian Wyglendowski
-
Anonymous
-
Sergey
-
Anonymous
-
jnoller
-
Eric Brunson
-
jnoller
-
Eric Brunson
-
http://jessenoller.com jnoller