Generating re-creatable random files…

February 27th, 2009 § 7 comments § permalink

… And the case of obses­sive opti­miza­tion. A lit­tle while ago, I posted a small snip­pet of code that was designed to gen­er­ate data files of a given size, based off a seed very quickly (arti­cle here). The goals of this code is/was the following:

  • Gen­er­ate large amounts of semi-random data quickly
  • Data gen­er­a­tion can not use /dev/urandom or other sys­tem entropy buck­ets. These are to slow, and hav­ing hun­dred of threads pulling from these buck­ets is a bad idea. Oh — and it needs to work on windows.
  • The data must never be sync’ed to disk: when you’re gen­er­at­ing a large data set, on the scale of hun­dreds of mil­lions of files, stor­ing it on disk sucks, and the disk becomes the bottleneck.
  • Cre­ation of the files must be at least 1 gigabit/second — this means a sin­gle thread pass­ing one of these gen­er­a­tors to say, a pycurl han­dle could “in the­ory” hit line speed: the gen­er­a­tor can not be the bottleneck
  • The data in the­ses files must be able to be recre­ated at any time pro­vided you have the seed.
  • Set­ting a seed in python’s ran­dom() has side-effect issues, and can not be used. Besides, lots of ran­dom calls are expensive.
  • I need the abil­ity to swap out the data source, I use a lorem file here, but a dif­fer­ent type will be needed later.
  • The data source should only be parsed once for the import (sin­gle­ton, ho!)
  • The name, and the file data must be unique — they must hash dif­fer­ently (to pre­vent de-dupers from, well, de dup­ing them)

I am revis­it­ing this code as we found out the orig­i­nal ver­sion could only gen­er­ate file data at around 500 megabits/second. This is much too slow for my tastes, as I might as well be read­ing it from disk. We can make it faster.

After clean­ing things up, remov­ing some overly com­plex logic (and sev­eral moments of “what the hell was I think­ing”), I came up with this:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
LOREM = os.path.join(os.path.dirname(__file__), "datafiles", "lorem.txt")
WORDS = open(LOREM, "r").read().split()
 
def chunker(size, seed, chunksize=1000):
    word_q = collections.deque(WORDS)
    seed_q = collections.deque(int(i) for i in str(seed))
    # Rotate the word_q by the seed so that small files are unique.
    word_q.rotate(seed)
    current_size = size
    while current_size > 0:
        data = ' '.join(word_q)
        if chunksize > current_size:
            chunksize = current_size
        chunksize = (yield data[0:chunksize]) or chunksize
        current_size -= chunksize
        word_q.rotate(seed_q[0])
        seed_q.rotate(1)
 
class SyntheticFile(object):
    """ File-Like object backed by the ``chunker`` function. Allows the
    construction of an object which can be passed to something like a pycurl
    handle streaming data to a server """
    def __init__(self, size, seed):
        """ 
        **size**: integer, bytes
        **seed**: integer
        **chunksize**: optional, integer
        """
        self.chunker = None
        self.size = size
        self.seed = seed
 
    def write(self):
        """ unsupported, throw an error if called """
        raise Exception('not supported')
 
    def read(self, readsize):
        """ Support read() - **readsize** is in bytes. """
        if not self.chunker:
            self.chunker = chunker(self.size, self.seed, readsize)
            return self.chunker.next()
        try:
            return self.chunker.send(readsize or 1000)
        except StopIteration:
            pass
        return ""

This ver­sion hit around 618 megabits/second and it used the generator’s send() capa­bil­ity to allow read­ers using the Syn­thet­ic­File imple­men­ta­tion to alter the chunk size they’re read­ing on the fly, which is impor­tant if you have a con­sumer that wants the abil­ity to read small/read big/read small. Well, that’s fine and all, but I was stymied — I wanted to make this thing fly. I want to be able to gen­er­ate this data at at least 1 gigabit/second, if not faster.

Astute read­ers may point out that there’s other ways of doing this — mmap, sim­ply embed­ding the unique seed or a uuid — well, this story isn’t about that, is it?

In any case, I sus­pected the “data = ’ ‘.join(word_q)” line was the cul­prit — deque is pretty opti­mized, and I had removed a mas­sive chunk of code which didn’t make sense, and in fact, cPro­file showed I was right:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441475 function calls in 203.791 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352  180.282    0.000  180.282    0.000 {method 'join' of 'str'  objects}
...snip...

180 out of 203 cpu sec­onds, on the join alone. Curses! So this is when I really went men­tal (this is what hap­pens when you’re too close to some­thing). I decided that I needed to find some mag­i­cal way of skip­ping the join and only read­ing what I needed. I ran down that rathole for a bit, until a friend of mine point out “just make the words bigger”.

Full stop. I ini­tially dis­counted it, I was zeroed in on that join — oh wait. The text in the lorem file when split on white­space is 4368 words. Join­ing those back together within the loop is expen­sive — that much I knew. I hit on the idea that if instead of con­sid­er­ing them words, I thought of them as chunks (which is how I was treat­ing them).

I added a method (process_chunks) which treated the data source as chunks of bytes and made the WORDS vari­able a list of those chunks. Ini­tially, I set the chunk size to 100 (bytes) and here’s the cPro­file output:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441766 function calls in 51.549 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352   29.712    0.000   29.712    0.000 {method 'join' of 'str' objects}
...snip...

And now the gen­er­a­tor is kick­ing data out at 2.34 giga­bits. Huge suc­cess. Obvi­ously, if you increase the chunk size, it speeds up a bit more (e.g. 300 byte chunks is about 2.5 gigabits/second). I cleaned it up a bit and here is the code:

(thanks! bitbucket.org).
Note that the speeds I’m dis­cussing are pass­ing the Syn­th­FileOb­ject to a pycurl han­dle and stream­ing it across the wire: not to disk.

All told, it was a fun lit­tle jaunt, and I’ve suc­ceeded to make some­thing which I con­sid­ered “throw­away” into some­thing that’s a lot more use­ful, clean and fast. I’ve added a hand­ful of unit tests to my sand­box, and I might make this a real mod­ule if any­one wants it. I want to rework the _process_chunk/globals stuff, but I farted around with this long enough for now. I also want to add the abil­ity to remove the chunk­ing alto­gether and sim­ply insert the seed into the data response, and not mess with the lorem text.

edit: I just checked in a new ver­sion which removes the _process_chunks func­tion and other glob­als and moves them into a class. I hate globals.

Sphinx and auto-building/tests

February 26th, 2009 § 4 comments § permalink

cat.jpgOk, so I’m think I’m in sphinx-love. I’ve needed to really begin a lar­gish doc­u­men­ta­tion project for a code base I own and drive (omg­man­ager­s­peak) and since I’d rather not com­pletely rely on API docs, and I have expo­sure to sphinx cour­tesy of python-core work, I chose you sphinx-a-chu!

Sphinx really is awe­some. I started just chug­ging through the docs, and ended up pulling from tip (via mer­cu­r­ial) and using the lat­est ver­sion for the theme sup­port Georg added recently.

Rather than rehash the basics, I’ll sim­ple explain my setup, and why I love it — start­ing with the out­put from sphinx-quickstart which kicks out a make­file for you, I imme­di­ately turned on the ‘sphinx.ext.autodoc’ and ‘sphinx.ext.doctest’ exten­sions — the for­mer allows you to tell sphinx (in your rst file) to del­e­gate the doc­u­men­ta­tion for this class/method/etc to the API docs, and the lat­ter allows you to pass the examples/snippets you will have through doctest.

These two things make writing/testing the docs awe­some — but it gets more awesome.

With sphinx, you can run make html to run your build, part of which is the ver­i­fi­ca­tion of your .rst syn­tax. I added a tar­get to the Make­file to run the doctests:

doctest:
	mkdir -p build/doctests
	$(SPHINXBUILD) -b doctest source build/doctests
	@echo
	@echo "Doctests passed"

This means I can run “make doctests html” and I get every­thing I want. Hooray!

But wait.. I don’t want to have to edit, build, and open the html to review/read it — we can make this smarter, faster.

So, loaded up Bruno Bord’s tdae­mon and hacked it up so that I could pass it in a cus­tom com­mand (I added a sphinx argu­ment), so after I point it at my doc/source/ direc­tory — any time I hit save the “make doctest html” tar­get runs.

This is nice — it’s instant build/feedback. I cou­pled this with the sim­ple “python –m Sim­ple­HTTPServer” run­ning in the build/html/ direc­tory of the docs and leave my browser pointed at localhost:8000.

All I need to do is hit save, and pro­vided I haven’t fail­boated the tests, I can just hit refresh and see things immediately.

I’m prob­a­bly going to hack tdae­mon up a bit to either accept a com­mand series, or allow arbi­trary com­mands so I can remove the hard­coded sphinx logic I added, and pos­si­bly add the capa­bil­ity of watch­ing mul­ti­ple direc­to­ries (and sup­port­ing cus­tom (dif­fer­ent) com­mands to each.

This way, I could say “tdae­mon –dir1=sphinxsource –cmd=“make doctests html” –dir2=pythondir –cmd=“nosetests –s –v”. This way if I hit save on any file I’m work­ing on I get this same feed­back loop.

I can hack this so it auto-tests/deploys a Django appli­ca­tion as well (on changes) to my local sand­box as well. Just some­thing neat, that’s work­ing well for me. Addi­tion­ally, I want to add some intel­li­gence that in the case of a fail­ure, the log file is auto­mat­i­cally opened in the edi­tor of my choice via the OS/X open command.

I should add, that I also hacked tdae­mon to mod­ify my PYTHONPATH to add the project I’m hack­ing on so I don’t need to install/deploy it for the API inte­gra­tion in sphinx to work. This allows me to have this type of setup on a per-branch basis.

Stackless: You got your coroutines in my subroutines.

February 23rd, 2009 § 7 comments § permalink

Note:This is another post in what I hope will be a series lead­ing up to my concurrency/distributed sys­tems talk at PyCon. I’m steadily work­ing through exper­i­ment­ing with and learn­ing the var­i­ous frameworks/libraries in the python ecosystem.

I reserve the right (and prob­a­bly will) to revise these entries based on feed­back from peo­ple (mainly the author(s) of said tool(s)). I will also add addi­tional bits and pieces as I learn and explore more./Note

Stack­less python — here’s another big one on the pile — is much more than a library, or a frame­work which runs on CPython — Stack­less is actu­ally a mod­i­fied ver­sion of the CPython inter­preter. It’s much more than just a C-extension. Stack­less is in use by var­i­ous peo­ple and com­pa­nies — most notably, it’s in use by CCP Games, mak­ers of Eve Online (see this pycon pre­sen­ta­tion). In fact, CCP Games is a large part of why Stack­less is still around today.

» Read the rest of this entry «

PyCon 2009: In ur brain, giving you the pythons

February 21st, 2009 § 4 comments § permalink

PyCon 2009: Chicago I, along with a whole heck of a lot of other peo­ple will be attend­ing PyCon in march. You should know about this by now, unless you’re liv­ing under a rock, or in a shoe­box (I like shoeboxes).

PyCon 09 is turn­ing out to be one of the ones I am most excited about in some time — bar­ring the fact they let me actu­ally stand up and speak about some­thing, there’s a ton of other excel­lent and excit­ing things going on.

I will be doing two talks — “Intro­duc­tion to Mul­ti­pro­cess­ing in Python” on Fri­day, at 3:20 PM, and “Con­cur­rency and Dis­trib­uted Com­put­ing with Python Today” at 3:20 PM Saturday.

The for­mer talk is easy, given it will be focused on intro­duc­ing the mul­ti­pro­cess­ing mod­ule to the masses, and tak­ing ques­tions about it (be gen­tle, I just work here). The lat­ter is a much big­ger beast. I’ve been blog­ging about my research in my pycon 2009 cat­e­gory, and I still have a pile of things to keep adding to that. The talk will attempt to dif­fer­en­ti­ate con­cur­rency from dis­trib­uted sys­tems, and show the var­i­ous toolkits/frameworks/etc in the ecosys­tem today, to help you build both types of systems.

Given both talks are 45 min­utes in length, I will be pub­lish­ing my talk notes (my slides are not going to be heavy weight) and other infor­ma­tion here.

In addi­tion to me speak­ing, which may or may not be excit­ing, there’s one hel­luva ton of other talks which sim­ply look awesome.

My sched­ule looks like this:

  • Thurs­day: Python Lan­guage Sum­mit, where I will endeavor to be smart.
  • Fri­day: How to give a python talk
  • Fri­day: Using Windmill
  • Fri­day: Intro­duc­tion to Python Pro­fil­ing or How Python is Devel­oped, to harass Brett.
  • Fri­day: Panel — Python VMs
  • Fri­day: Build­ing an Auto­mated QA infra­struc­ture using Open Source Tools
  • Fri­day: Twisted, AMQP and Thrift: Bridg­ing mes­sag­ing and RPC for build­ing scal­able dis­trib­uted applications
  • Fri­day: My Talk (I fig­ure I should go to it)
  • Fri­day: A Whirl­wind Excur­sion through Writ­ing a C Exten­sion or Chal­lenges and Oppor­tu­ni­ties for Python
  • Thurs­day: Plu­g­ins and mon­key­patch­ing: increas­ing flex­i­bil­ity, deal­ing with inflexibility
  • Sat­ur­day: The (lack of) design pat­terns in Python or the Pinax talk
  • Sat­ur­day: Class Dec­o­ra­tors: Rad­i­cally Simple
  • Sat­ur­day: Panel: Object Rela­tional Map­pers: Philoso­phies and Design Decisions.
  • Sat­ur­day: Drop ACID and think about data (I won­der if we can take this lit­er­ally, if he has scary slides though, we all might trip balls)
  • Sat­ur­day: My Talk, which is up against Bruce Eckel and Ray­mond H — I expect no one to show up.
  • Sun­day: Panel: Func­tional Test­ing Tools in Python
  • Sun­day: Design­ing a web frame­work: Django’s design decisions

This doesn’t even cover the open space dis­cus­sions which I might attend — includ­ing the “Writ­ing About Python” one Doug Hell­mann is putting together as well as the “Teach Me Web Test­ing” one by Steve Holden.

After the main con­fer­ence, I’ll be stick­ing around until Thurs­day morn­ing for the sprints, at which point I should be suf­fi­ciently burned out on python stuff, I will fully con­vert over to being a full time burger flipper.

Right now there are 620 reg­is­tered peo­ple who made their atten­dance pub­lic I don’t know how many there are in total.

Hope to see you there!

funny-pictures-cat-dog-paper-bag-shrubbery-holy-grail.jpg

Twisted — hello, asynchronous programming

February 11th, 2009 § 5 comments § permalink

Note:This is the third post in what I hope will be a series lead­ing up to my concurrency/distributed sys­tems talk at PyCon. I’m steadily work­ing through exper­i­ment­ing with and learn­ing the var­i­ous frameworks/libraries in the python ecosystem.

I reserve the right (and prob­a­bly will) to revise these entries based on feed­back from peo­ple (mainly the author(s) of said tool(s)). I will also add addi­tional bits and pieces as I learn and explore more. Addi­tion­ally, thanks to glyph for giv­ing me a hell of a lot of feed­back./Note

Twisted is the 800 lbs gorilla of the “con­cur­rency” frame­works. It’s been around for awhile, has a large fol­low­ing — it’s used by every­one from Apple (iCal server) to Build­bot Build­bot. It has a lit­eral ton of sub projects and other “semi attached appendages”.
» Read the rest of this entry «

SSH Programming with Paramiko | Completely Different

February 5th, 2009 § 20 comments § permalink

OpenSSH is the ubiq­ui­tous method of remote access for secure remote-machine login and file trans­fers. Many peo­ple — sys­tems admin­is­tra­tors, test automa­tion engi­neers, web devel­op­ers and oth­ers have to use and inter­act with it daily. Script­ing SSH access and file trans­fers with Python can be frus­trat­ing — but the Paramiko mod­ule solves that in a pow­er­ful way.

This is a reprint of an arti­cle I wrote for Python Mag­a­zine as a Com­pletely Dif­fer­ent col­umn that was pub­lished in the Octo­ber 2008 issue. I have repub­lished this in its orig­i­nal form, bugs and all

» Read the rest of this entry «

A (brief) introduction to Python-Core development | Completely Different

February 4th, 2009 § 2 comments § permalink

This is a reprint of an arti­cle I wrote for Python Mag­a­zine as a Com­pletely Dif­fer­ent col­umn that was pub­lished in the August 2008 issue.

In the early sum­mer of this year I had the chance to really get started work­ing on/with the core Python source. I had spent some time putting together a Python Enhance­ment Pro­posal (PEP) which was accepted. Now, I just needed to learn the code base, prac­tices and buy a hel­met. Shortly after get­ting the ini­tial patch accepted, I ended up break­ing the build, tests and caused the beta to slip. This arti­cle is an intro­duc­tion to Core devel­op­ment, in which we’ll cover what you need to get started, and where I per­son­ally screwed up.

» Read the rest of this entry «

Get with the program as contextmanager | Completely Different

February 3rd, 2009 § 5 comments § permalink

One of the cooler fea­tures that came with Python 2.5’s release is the ‘with’ state­ment and the con­text man­ager pro­to­col behind it. I could make the argu­ment that these two things alone make the upgrade to Python 2.5 more than com­pelling for those of you trapped in the dark ages of 2.4 or worse: 2.3!

This is a reprint of an arti­cle I wrote for Python Mag­a­zine as a Com­pletely Dif­fer­ent col­umn that was pub­lished in the July 2008 issue. I have repub­lished this in its orig­i­nal form, bugs and all

» Read the rest of this entry «

An Interview With Adam Olsen, Author of Safe Threading | Completely Different

February 2nd, 2009 § 8 comments § permalink

This is a reprint of an arti­cle I wrote for Python Mag­a­zine as a Com­pletely Dif­fer­ent col­umn that was pub­lished in the June 2008 issue.

A world with­out a Global Inter­preter Lock (GIL) — the very thought of it makes some peo­ple very, very happy. At PyCon 2007 Guido openly stated that he would not be against a GIL-less imple­men­ta­tion of Python, pro­vided some­one coughed up the patch itself. Right now, that some­one is Adam Olsen — an ama­teur pro­gram­mer who has been work­ing on a patch to the CPython inter­preter since July of 2007.

It’s PyCon. I’m sup­posed to be lis­ten­ing to a talk, but I’ve fallen down the rab­bit hole of a future with­out a global inter­preter lock. I’m locked in on get­ting a patched ver­sion of the inter­preter up and run­ning on Mac OS/X and the patch author, Adam Olsen, is coach­ing me through changes to some of the deep­est inter­nals of Python itself.
» Read the rest of this entry «

Python Threads and the Global Interpreter Lock

February 1st, 2009 § 22 comments § permalink

There are a plethora of mech­a­nisms and tech­nolo­gies sur­round­ing con­cur­rent pro­gram­ming — Python has sup­port for many of them. In this arti­cle we will explain, exam­ine, and bench­mark Python’s thread­ing sup­port, and dis­cuss the much maligned Global Inter­preter Lock (GIL).

This is a reprint of a fea­tured arti­cle I wrote for Python Mag­a­zine that was pub­lished in the Decem­ber 2007 issue. This arti­cle assisted in inspir­ing me to write PEP 371.

» Read the rest of this entry «

Where am I?

You are currently viewing the archives for February, 2009 at jessenoller.com.