| Subcribe via RSS

Benchmark followup: Google-code Edition

September 16th, 2007 | 6 Comments | Posted in Programming, Python

Since last weeks tempest in a blogpot, and my subsequent post "Have GIL: Want Benchmarks" I've been doing a lot of reading, planning and discussion with people.

What this has culminated in is a google-code project I started late last week, tapping Daniel Watkins and picking the brains of others to start pulling together information about the entire threaded/concurrency domain.

The google-code project is named "python-distributed" and right now, has Daniel Watkins initial work, plus a bunch of wiki pages with my research in it. It is not exhaustive by any means - I've only just started taking half-baked blog posts, notes scattered all over my hard drive and code that I've got an trying to organize it into something.

I know people want "numbers". I know people want hard, fast and decisive numbers - that's what benchmarks are, right? More frequently, benchmarks are a lightning rod of controversy. Some will yell fowl, others will trumpet them as the greatest thing since "Good Night Moon" (truly, a god send). The only good benchmark is the one without an ulterior motive, and with the information out in the open so that everyone can run it.

I like to think of "numbers" being a non-goal for this project - personally, I want to explore all of the concurrency-like and true concurrency/threading/etc packages out there for python with a simple set of baseline tests. The eventual goal is to find one (or build one) that will make it into the python-lib the library must be easy to use, safe, and pythonic.

I've spoken to many people and my goal is to generate a lot of tests - my primary background is in test engineering (automation) - so laying out what code tests we want, what packaged we want to test and how we want to test them is primary in my mind. I mean, just the chance to visit all of these packages and ideas and hack up some code, delicious, delicious code.

So, with that all said: I'm sure people want to contribute, review, etc. With any luck I'll be able to start pushing in real code shortly, and Daniel has already pushed in his initial test code to his sandbox.

If you want to add-to, contribute, anything - feel free to shoot me an email to ask me to add you as a contributor1.

  1. Again, the project is here []

workingenv is dead, long live virtualenv

September 14th, 2007 | | Posted in Programming, Python

Just saw the pypi record show up - looks like Ian Bicking has replaced good'ole workingenv.py which I so love with virtualenv.py. Check it out, we loves our sandboxes, yes we do.

Off to bed though - a trip to the vet, pending infant awakening for food - good times. Although, I find it amusing still that I took both the cat, and my baby daughter to the vet in plastic carriers (one a crate, one a car seat).

Have GIL: Want Benchmarks.

September 12th, 2007 | 9 Comments | Posted in Programming, Python

IMG_0869.JPG1 So, with the recent furvor over the GIL, one of the things GvR asked for was for someone to provide some benchmark numbers of Single v. Multi v. Other in various tests.

I started working on this a night or so ago, and a lot of things have fallen out of it - not the least of which is my burning desire to really look at the GIL, threading in python and the ecosystem of alternatives.

A thing of Note: I've spawned an initial google code project to explore the alternatives/benchmarks. Read about it here.

The first test script I wrote was a simple one: take a function which calculated a number of fibonacci numbers and run it in a loop with larger and larger sets. I then:

  • Ran it twice, synchronusly.
  • Ran it twice, once in each thread (2 threads).
  • Ran it twice, but used the processing module to spawn to processes, running it once in each.
  • Ran it twice, but used the parallel python module to spawn a pool of workers for 2 processors, running it once in each.

Yes. I know that fib calculations are a processor-only activity, and that's all I wanted to do at first. I want to add tests for file access, network access, and I want to run the tests (as allowed by the extenstion modules) in PyPy, Jython and IronPython2.

In the long run - I want to make something of a test suite to not only demo the various methods/alternatives but to also to work towards a future where something like parallel python or the processing module can make it into the standard python library.

Not to mention, for those looking for alternatives to basic threading in python (or more information on threading in python). Yes, things that have to be taken into account:

  • Shared vs. unshared memory and state.
  • Ease of use/API
  • Amount of Fun.
  • Can the solution spread across machines?
  • Is it, or can it be, useful for web applications? (i.e: can it be used to spread load across processors)

If you have a test or a suggestion - drop me an email or post a comment. Heck, this might just turn out to be a learning experience for me or it could turn out to be something bigger.

Update: Wow, you guys are awesome. The private and public comments I've gotten on this have left me with a lot to think about - and hell - a lot to learn. I'm still in the toying/planning phase, so please keep feeding me information. When I get more time this weekend (if I get time) I will try to put up something cohesive, including a subversion repository and a maybe a wiki.
Update 2: Hello Reddit.

  1. Yes, this is a gratutitous picture of my daughter []
  2. Yes, Daniel Watkins has started this process as well []

Chandler: 0.7.0.1 is out.

September 11th, 2007 | 3 Comments | Posted in Programming, Python

To quote:

First release suitable for regular end users.

Remember boys and girls, it's in python (and most of the modules are on the cheeseshop) be sure to give it a whirl.

Whoops! Forgot a link.

Bruce Eckel: Parallel Python

September 11th, 2007 | 1 Comment | Posted in Programming, Python

Woop! Bruce ran into the PP module after the flareup over the GIL yesterday - he seems to like it. I'm writing a few things up on all of this (and providing demos) and PP is one of the modules that due to it's simplicity and ease-of-use I am focusing on.

Parallel Python: "My previous post led me to this library, which appears to solve the coarse-grained parallelism problem quite elegantly."

Update to add: A "quick benchmark" has also been posted (via reddit)

Interesting Read: Tear Down that GIL!

September 10th, 2007 | | Posted in Programming, Python

This is a post that popped up this AM for me - "An open letter to Guido van Rossum: Mr Rossum, tear down that GIL!". It's an interesting read if nothing for the fact that the GIL is the biggest piece of cane with which people beat python with (I don't count significant whitespace whining). Take a peek at that, and then read the reddit discussion.

I've been writing a piece on the GIL off and on for a few weeks now - collecting discussions and information and trying to piece together the most cognizant series of thoughts I could muster.

The interesting thing is that the GIL was also one of the point Bruce Eckel touched on in his recent "Python 3K or Python 2.9?" post on the 8th - be sure to read the comments there too.

I've collected about 30 or so GIL/Concurrency related tidbits I'll put together in a piece, but I am interested in hearing other opinions about the GIL before I reveal my particular horse in this race.

Call me wierd - but threading itself is a bit of a beast in my eyes: it could be because I've never built a giant UI application, or I am used to talking amongst machines rather than cores on a single machine. It seems(?) better to plan to scale across nodes in a cluster rather than processors on the machine. Coding to the former begets the latter, no? Note that I do use threading - quite freqently I might add.

Update: Michael Tsai weighs in on the "open letter".

Update Part 2: Guido's posted a response to the open letter. I think this clarifies the BDFL's position as one of "give me something and we'll see". ( And the reddit discussion now )

Advanced Django Presentation from Simon Willison

September 8th, 2007 | 2 Comments | Posted in Personal, Programming

Just crossed Reddit, it's right here - I don't think poor Titus will ever live down the quote from the last pycon:

"I don’t do test driven development. I do stupidity driven testing... I wait until I do something stupid, and then write tests to avoid doing it again." - Titus Brown

1
There's a lot of django-testing information in the slides. Also, newforms is covered. I'm still trying to wrap my head around newforms for a knowledge base system I am trying to write on my free time in django2.

Take a look at the presentation. I <3 django.

  1. Hey, I was there when he said it! []
  2. I have sketches of MODELS man, SKETCHES! If I fire up some UML I can ship it, right?! []

CheeseShopping: python-application, mglob and pickleshare

September 8th, 2007 | 2 Comments | Posted in Programming, Python

Well, it's been a bit since I've done one of these run throughs. One of the RSS feeds I watch (out of 120) is for the Python CheeseShop - this is where a lot of very interesting modules are uploaded by community authors, some of more interest than others.

When looking at modules at the cheeseshop I always keep an eye towards code examples - finding a particularly interesting implementation of something (say, the debug/memory.py module in python-application) always helps me improve my code/applications/etc.

I like to check out (albeit briefly) and write down notes about modules of interest that I see - I have a backlog of around fourty modules I have notes on. This morning, I saw 3 that piqued my interest. (Note, I started writing this earlier this week - only just now finishing it.)

As a side note: many of these modules can be installed via easy_install.py - I tend not to randomly install modules (and pollute my path!), preferring instead to grab the tarball and poke around in a sandbox/workingenv.py style environment.

First up is python-application (v 1.0.9) which is, to quote:

This package is a collection of modules that are useful when building python applications. Their purpose is to eliminate the need to divert resources into implementing the small tasks that every application needs to do in order to run successfully and focus instead on the application logic itself.

I snagged this, and there are some excellent code examples/useful tidbits in the package - I don't know if I would use the entire thing in a given application - of particular note was this snipper from application/debug/memory.py:

 
import gc
def memory_dump():
    print "\nGARBAGE:"
    gc.collect()
    print "\nGARBAGE OBJECTS:"
    for x in gc.garbage:
        s = str(x)
        if len(s) > 80:
            s = s[:77] + '...'
        print "%s\n  %s" % (type(x), s)
gc.enable()
gc.collect() ## Ignore collectable garbage up to this point
gc.set_debug(gc.DEBUG_LEAK)
 

The module is documented well - all you have to do it import * from the module and then call memory_dump() later. The datatypes.py module in the configuration directory was also a very nice example. Also, the process.py module on the top level. I'd suggest taking a look at it just to learn more - everyone has their own mise en place or tool box so to speak - we all have our own little bits of code we carry from application to application. This package is an good example of simple, useful things that we always end up doing (and I learned a few tricks too).

As a side note, my toolbox code has evolved so rapidly and so drastically from when I first started hacking python, I sometimes wonder if one day, my basic if __name__ == "__main__": setup will gain sentience. I'm glad that most programmer like to share though - if we guarded our tools as closely as BBQ masters guard their rubs and sauces, we'd be in trouble.

Update to note: The author of both pickleshare and mglob added a comment to this post outlining some good information, as well as the fact both tools are/will be in IPython. Also, that mglob's syntax was intentional - "it’s optimized for brevity and convenience" (which is why I found it cryptic). For this particular tool (mglob) Path would not have helped him.

The next one is pickleshare (v0.3) quote:

PickleShare - a small 'shelve' like datastore with concurrency support
Like shelve, a PickleShareDB object acts like a normal dictionary. Unlike shelve, many processes can access the database simultaneously. Changing a value in database is immediately visible to other processes accessing the same database. Concurrency is possible because the values are stored in separate files. Hence the "database" is a directory where all files are governed by PickleShare.

Another quote from the readme:

Version note: this is an early beta version of the module. It has been tested (and works) in both Linux and Windows. This will probably end up as the interactive persistence system for IPython 0.7.2+, to make inter-ipython-session data sharing possible in real time.

This is an interesting module - shared objects/dbs in a concurrent system run the risk of various deadlock issues/data syncing issues/etc. This module aims to bypass that with the simple file-based workaround. In my (admittedly small) testing it seems to get the job done just fine - the fact that the "database" is written to disk (and therefore accessible without the pickleshare module itself and maintained through app runs obviously) is quite nice.

Cracking open the pickleshare.py module itself showed some very interesting code (again, teaching more tricks) - for more enlightenment, read the test() method1. This is a very interesting module, and the usage/class style: PickleShareDB(UserDict.DictMixin) was very useful.

I'd like to play with this module + the processing module.

And finally, mglob (v0.4), which is:

Usable as stand-alone utility (for xargs, backticks etc.), or as a globbing library for own python programs. Globbing the sys.argv is something that almost every Windows script has to perform manually, and this module is here to help with that task. Also Unix users will benefit from enhanced features such as recursion, exclusion, and directory omission.

I put this in my little ~/toolbox binary dir as soon as I started playing with it - as a command line utility, it's insanely useful (yes, I also know there are other tools out there like this). The command line syntax is sort of counter-intuitive at first (for example, I wanted to fine all of my mp3s):

woot:~/Desktop/Downloads/tmp/mglob-0.4 jesse$ python mglob.py rec:/Users/=*.mp3

The syntax is rec: (recursive) /Users/ (directory to transverse) =*.mp3 (files to find). This is of course explained in the help functions of the script. Using this in a python application is also sort of cryptic, but matches the command line:

 
from mglob import expand
expand("rec:/Users/=*.mp3")
 

You get a full list back from the result of the glob - this would be a serious problem for massive-recursive globs (the option to use a generator to yield() would be nice). I like the command-line usage, but for pure code-globbing, things like Jason Orendorff's Path module and other alternatives just feel better API-wise2

  1. I've gotten into the habit of TDD/reading tests to determine functionality more and more lately learning Java []
  2. Jason's site seems down, I put a copy of path.py I had here []

David Stanek: Announcing Design Python Pattern of the Week

September 4th, 2007 | | Posted in Programming, Python

I know, I know - I meant to do this myself a little while ago. It seems all of my projects seem to be backing up behind the wreck on my mental freeway between a new infant and learning Java end-to-end. David Stanek is promising one of the GoF Patterns a week - check it out! Announcing Design Python Pattern of the Week