A short list of things I don’t like about Python


Yeah, I haven’t posted in awhile - since pycon I’ve been sick off an on, working my butt off at the place which allows me to purchase ice cream for my kid, and so on. Busy busy. Not to mention, I’ve been suffering a slight case of burnout - long story.

That all being said, I think it was last week when I twittered a minor philosophical point which was picked up and ran with by pydanny. The little point I made was something like:

I don’t think it’s unreasonable to be able to name at least 5 things you don’t like/would change about something you love. Implementation details are fair game too

stop-whining.jpg
Now, before I delve into my personal list, I want to provide some context to this comment. It actually has some history, and it’s not an original thought - I think Titus Brown started a meme around this last year. In his case, it was purely based around python.

In my case, I’ve long maintained that if you can not name things you would change, irk you or generally dislike about something (not just a language) you supposedly love, whether it be a tool, a language, an OS, etc - then it shows you have a certain lack of self-awareness or pragmatism (there is another word I’m grasping for here, but it escapes me).

Historically, I’ll ask this in interview situations whether I’m speaking with someone who is a test engineer (name and explain 5 things you love/hate about automated testing?) a programmer (name and explain 5 things you love/hate about language $FOO) - generally speaking, this is great discussion fodder, and allows you to probe the thought process of the candidate.

For example, if someone says “I hate Python’s whitespace” and they’re interviewing for a Python coding position, I think it fair game to dig into that a bit and see if its rational, and ultimately ask the question: If you hate something so fundamental, why do you use it/why do you want to program in it for the foreseeable future?

In any case, I promised a few people I’d give them the shortlist of nits (I don’t hate these things, I simply dislike them) I have with Python. It’s important to remember that:

  1. I contribute work to python-core (see the multiprocessing module)
  2. I program in Python daily for work, and in my free time too.
  3. I participate (when I’m not on a self imposed exile) on the mailing lists and discussions (see Python-dev, etc.
  4. I too, am a strong believer in “put up or shut up”

Now, part of me is sad that I have to preface me being critical with a disclaimer like the above; but alas - some people, especially those on the internet, thrive on controversy and fail to read more than 5 or 6 words before posting some half-witted response, or worse yet, someone skims to the gripes I have, finds one they want to take me to task about and says “SUBMIT A PATCH !!!11″.

I do contribute back, so you can avoid telling me to submit a patch, ok?

That all being said, here’s my list:

  1. Concurrency: This is actually a love/hate topic for me. Obviously, I’m the maintainer of the multiprocessing module, which sidesteps the GIL, but the GIL is still an irritant for me (given I do write a lot of threaded code). A lot of people are very familiar with the fact I am a proponent of threads and processes/IPC, as both serve different (yet overlapping) purposes. There is room for both. Hopefully unladen-swallow will be able to get rid of the GIL, and then we can all move on with our lives: So long as in killing it, we don’t hose the ecosystem of C extensions.
    • Additionally, I would love to see a decent coroutine implementation included in the standard library, once PEP 380 is done and in the bag, if you need justification, see David Beazley’s coroutine talk. Again, while people might disagree with this, saying that coroutines/processes/threads all “do the same thing” and would violate “TIOWTDI” (There’s Only One Way To Do It) I would strongly disagree with them. In the case of concurrency, different solutions fit different problems. We do not have a grand unified theory of concurrency within python.
    • Also in the concurrency vein, I would like to see a cross language messaging/serialization system/format eventually come in. Right now, we have pickle; most recently, JSON - and JSON might be the final answer in this regard, but something akin to protocol buffers has also piqued my interest. Given we have JSON, I’m not terribly hot on this one.
    • Finally, I’d like to see more of the java.util.concurrent abstractions migrated in. I mean, using python threads isn’t hard, seriously, but more/better abstractions make things nicer for everyone.

  2. The Standard Library: This, again, is a love/hate thing - I love the standard library, and I will gladly argue with anyone who suggests getting rid of it. However, that said - I would like to see the entire thing get a much better documentation treatment, the docs while good, could be 1000x better, more clear/etc. I would also make every single module in there PEP 8 compliant. I know that sounds like a style-nazi thing, but if that’s the style we’re to use, I think the first thing to adhere to that is the standard lib.
    • It’s also disorganized. While flat is better than nested, I’m sorry - but I think making it deeper and putting all the things like one another into the same namespaces does make sense.
    • I would also break out the stdlib from core. This idea was discussed at the python language summit, and I think almost everyone there was in agreement. The idea would be to separate out the stdlib into it’s own path inside the repo, and other python implementations (such as Jython/etc) could use that copy as their copy of the stdlib modules. Anything which was CPython specific (such as multiprocessing) would stay with core/be marked as CPython only.
    • Taking this concept of breaking out the standard library a little further: I would begin to evolve it a little more quickly. There’s a strong difference between changes to the language, and changes to the standard library. In the case of the former; it should evolve slowly, and carefully. In the case of the latter (the stdlib) I think it could - and possibly should, evolve more quickly. By evolve, I mean “get cleaned up, have things removed/added” more quickly. I do not, however, mean with less thought. There’s obviously a lot of “buts” and other concerns with this idea, but it’s just a thought. I think compartmentalizing this into python-core and python-stdlib meshes with how a lot of people think about things.

  3. The Docs: I touched on this in the stdlib one, but the standard library documentation, as well as rich examples for a lot of the core features are lacking. Many of them focus on syntax and not necessarily on use. For example, I would gladly integrate all of Doug Hellmann’s Python Module of the Week posts into the standard library documentation tomorrow, and wholesale if I could - his examples are much more rich than those we find in the current docs.
    Many people, including myself, have been working on making these better - in my case, I need to overhaul the multiprocessing docs when I have a chance.
    • Don’t get me wrong - I actually appreciate the docs we have, they keep me sane, but they can be better, more clear and in some cases, more practical. One or two examples for usage just doesn’t cut it.

    • update see: http://tosh.pl/gminick/gsoc/sphinx/

    cosmic-rex-excuse-me-wtf-r-u-doin.jpg

  4. Packaging: Ahhhhh! I’m not going to go too deep into this rabbit hole, especially given I know Tarek is hacking away at making python packaging a much better animal, but the entire setuptools/eggs/distutils/etc pile is well, frustrating. I just want a clean, standard way of packaging my packages, built into core, that doesn’t force me into install into the global site-packages directory. Also, uninstall, dammit. I know setuptools and easy_install and eggs were designed to scratch an itch: and I do use easy_install, but the entire pile of things need to be made into a standard, implemented in core and we need to move on.
    • However, as I pointed out during the language summit - I don’t think something like easy_install belongs in core, instead I think core should make what easy_install does (to a certain extent) easier and standard, so people can use whatever tools/scripts/etc they want. One ring to bind them!

  5. Linting: Ok, face it, if you’re on a big enough team, you need to have a pre commit hook for your VCS that lints the code, and yacks if it doesn’t conform. I would love for one to be built into the stdlib, but something like pylint is too big, pychecker is too simple, and I haven’t used pyflakes recently enough to comment. There was a thread on python-ideas about this recently - and maybe Jeremy Hylton is right, and it doesn’t belong in core, but if that’s the case, we need to pick one to “endorse” on the python doc website. Maybe in a “getting started with developing python” document, which is linked in size 30 font, and links to a linter, maybe the pep8.py and reindent.py scripts, etc. It should be painfully obvious where to get and how to use these tools. Yeah, I know “waaaaah why didn’t they link to mine” - well, because we liked this one over here more. QQ.
    • As it stands, I can not count the number of times I’ve been asked about linters and style checkers for python code. Maybe we make three packages: python-core, python-stdlib and python-tools.

  6. Optional Static Typing: This one doesn’t make me feel like I’ll make any friends, but I would love to have pre-runtime, static typing as an option to python - maybe as a –anal-types flag. Guido has discussed (part 2) the difficulties of this before, so I don’t think this will ever come (the closest we get to “type safety” is function annotations, which make me feel funny in sensitive places). The biggest reason I have for static typing of any flavor, is that I would prefer to have the ability to catch some errors prior to runtime. That’s all. On a big enough team, all hacking on the same (massive) python code base, I’ve found you do want the ability to turn something like this on - it helps you with a (small) class of very annoying bugs.
    • I do love me the dynamic/late typing system of python, and I use it to my advantage as much as possible. So, I wouldn’t trade the dynamism of Python for static types, it’s just an nit I have. Of course, maybe something like interfaces (as Jacob points out in his list) might solve some of the issues I have (mainly bad people doing silly things). The rest of the stuff is why I write unit tests and actually run the damned code.
    • Yes, I know the drawbacks of something like this; I also don’t have some sort of magic solution to be able to wave a wand and do this. Nor do I have a concrete proposal, otherwise you’d see an email on python-dev. Other people much smarter than me have pointed out the sheer enormity and numerous drawbacks to something like this. No, I don’t expect magic fairy dusty to suddenly appear and just “make this work”.

  7. Standard Library Part II: Yeah, you might notice a lot of my gripes are around the stdlib - but in particular, I want to point out the state of XML handling in the standard library is about as clear as wearing glasses made of meat. Additionally, the httplib/urllib/urllib2 thing? Yeah. No.
    • While I’m harping on this stuff, get rid of the commands module, anything that is not in subprocess should be put there. Since I mentioned subprocess, needs more documentation also, non blocking asynchronous input/output/handling of subprocess data should be easy, and built in. There’s a GSoC project around this spinning up, so we’ll see.

That all being said, would I trade python for something else? Not right now. Most of my nits are exactly that: nits, and most of all, they’re not impossible to change or resolve (given enough time, and resources).

I can make a similar list for OS/X, Linux and other things I use day in and day out - hell, I can make one for myself (ask my wife about me griping about me sometime). I can probably make a list like this for every single thing I’ve written, tools, scripts, apps, etc.

Like I said, being aware of, and trying to overcome your own shortcomings is how we all improve. In the case of a language, you can’t just keep adding things into a standard library and call it “better” - you have to take a constant look at what you’ve done to date with a critical eye, and ask yourself “what can we do better”.

YAML ain’t Markup Language | Completely Different

When someone says “pick a markup language,” most people would immediately respond with “XML!”, but there’s an alternative out there. YAML is human-readable, easy to use, and overall quite fantastic.

This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the December 2008 issue. I have republished this in its original form, bugs and all

Continue Reading »

PyCon Wrapup or “stop fidgeting”

3405504555_e4422b42f6.jpgAh, PyCon 2009 has come and gone. I’m a little late in doing a wrap up - but that’s primarily due to the fact I wanted to spend time with the family, and not on the computer since I got back mid-day Thursday of last week. That - and other things in my life are a little worse-for-wear, so my brain’s been full of “other stuff” rather than basking in the glow of:

The best pycon, ever

Yeah, I came out and said it. For me, it was my 4th(?) PyCon, and not to disparage the past, but this one was the best planned, best executed and most fun for me. Sure, it had a few negatives (Chicago being one of them), but by far the pluses outweighed any minuses.

Here’s a summary of things I can remember.

The summits

I got in mid-day wednesday and broke into the VM summit. I was immediately slapped with the unladen-swallow announcement, which made me do a little dance and immediately start pulling down the code. For the most part, since I was past the introduction piece, I sat in back and had interesting discussions with Thomas Wouters, Brett Cannon and others.

Later that night, I sat down and spent some time with unladen swallow - mainly getting LLVM all happy and pulling down the trunk (when I should have been using the released tag). I also spent a fair amount of time massaging my slides for both of my talks. Nothing quite like last minute changes.

The next day was the language summit. Here a bunch of people much smarter than me sat down and discussed the future of the language. I can’t even begin to describe everything that happened - some basic highlights are:

  • Some discussions around making python 2.7 be the last release of the 2.x line. I think the general consensus was to continue to use 2.x as a ramp into 3.0, but in general if migration to (and from) 3.0 was faster/easier, then the adoption barrier for 3.x would be greatly lessened. Most of us finally agreed that the 2.x line should end with 2.7.
  • On 3.0 - porting issues and pains were discussed, things like 2to3 being slow, lack of incentive (3k being “a better language” notwithstanding) being one of the big issues. Someone (Guido I think) floated the idea that big features only go into the 3.x branch, ergo providing greater upgrade incentive. I don’t think a lot of people disagreed with this and in fact, I took it to heart during the sprints when looking at feature requests for multiprocessing.
  • Which brings us to 3to2 - several developers expressed that if we had a tool which could translate python3 code to python2, then they could start writing their frameworks/tools in pure python3 and then back port it. I don’t recall anyone disliking this idea - but no one wanted to get the ball rolling. So I raised my hand. In essence, I volunteered to get the ball rolling, which I have done. Benjamin Peterson will be heading it up primarily - but anyone is free to contribute (and some already have). It’s got potential as a google summer of code project.
  • After that we discussed what CPython can do to make integration/testing easier for alternative implementations. We agreed that being able to mark tests as “cpython only” and breaking things in the repository up in such a way to make integration of the standard library into alternative distributions easier. Brett has more on this, but the essential idea is to compartmentalize the standard library out into a separate project within source control, and have other implementation pull/ingrate that. Anything that’s Jython/CPython/etc specific would be marked/stored appropriately. For example, multiprocessing as it exists today is a CPython artifact, and not needed for say, Jython.
  • And then there was the packaging discussion. Rather then rehash everything that was discussed, I’ll simply mention that I proposed a much simpler approach. Specifically, I noted that we should simplify the core of distutils, pulling in setuptools feature where appropriate, offer plugin/hooks where appropriate, standardize on a simple, extensible metadata format and remove any of the “non core” bells and whistles from distutils. In essence my argument was that things like RPM building, easy_install, virtualenv, and so on simply do not belong in core python, rather they belong outside of the core, where consumers/developers can tailor solutions to suit their needs. Ultimately disutils can help them, and make the APIs/Plugins/Metadata standard but it should not be a fully fledged “package manager” or build utility.

Ultimately, the summits were a ton of fun, and being lucky enough to be invited was really great for me. At the end I was worried about people being tired of my chirping up, but it was resoundingly a success.

The conference

This is where things begin to blur - hopping in between hotels, lightning talks, hallway discussions - it was insane. Add to that my fear/stress about both of my talks, and I simply can’t fit everything that happened in my brain.

I think I started the day in “How to Give a Python Talk” by AMK - which was great, if disheartening simply due to the fact that I had made some of the mistakes he pointed out in my talks. Ergo more slide hacking. After that it was hallway discussions and hitting up “How Python is developed” by Brett - I was going to heckle him, but I played nice.

Then there was the Python VMs panel, which I swapped out of and into the “Building an Automated QA Infrastructure using Open-Source Python Tools” - which can be summed up as “Yay buildbot” (I’m hard to impress, I’ve built a few of these by now). I then hit up the Twisted/AMPQ talk but ducked out early to go prep for my talk.

Then, it was my multiprocessing talk time. You can see the full video here, I think it went well, and feedback has been positive. Given I’ve done variations on this talk elsewhere, I was a little more confident in myself. I think I could have slowed down in some parts, and stopped fidgeting as well as a few other nits. I had some great conversations with users of the package as well as people new to it which pretty much absorbed all my time until dinner. I think I hit up the lightning talks - but I can’t remember.

The next day, was the now infamous LINDBERG’D lightning talk (video here) - for those who don’t know - Yes, Van knew - he even wrote the legal disclaimer I have in my slides. Basically, Brett and I had stayed up until what - 1am one night laughing about this idea of sending Van (who isn’t very threatening) after commenters on the internet. The day after we stayed up laughing at this (much to the dismay of our neighbors) I hit Van up for some text and let him in on it. No, it wasn’t serious (people asked). Yes, I was also voted in with a pile of other people into the PSF. Good times.

Afterward, the guidonote and then know I saw the state of django talk, as well as Jack’s “Class Decorators: Radically Simple” - which was fantastic (although I saw a variation at the boston python meetup the week before). I dropped in on the ORM panel, and the GAE talk (which was disappointing). Then I hit Bob’s “Drop ACID and think about data” which was awesome. I again ducked out to double check my stuff for my distributed systems talk. Also on saturday was the “Writing About Python” BoF, which was cool, but in which I may have over-expressed several cough strong opinions.

The video of my second talk is here. This one, my lack of confidence is apparent, and I’m still fidgeting. I think part of my problem is I like to express with my hands, almost like a spastic air traffic control man and holding the mic/trying not to fidget made my body just sort of spaz. This time though, I rushed from my talk into Alex Martelli’s excellent “Abstractions as Leverage” talk. After that, more people found me and picked my brain.

I can’t stress how awesome Alex’s talk was - much of my hallway discussion time was spent discussing abstractions (especially around distributed systems) with him.

We’ll see how my call for some cohesion and a real “django for distributed systems” call in my second talk goes. Everyone I have spoken to likes the idea and had a lot of great feedback, but lord knows I don’t have the time to lead it up right now, maybe soon.

Late saturday, was the teach me web testing BoF which I ducked into, and then past that the Testing In Python BoF which provided me much in the way of fun, ideas, discussions and uh. Beer. Yeah. I can’t exactly remember what I said when Titus put me on the spot to discuss “what sucks about testing in python”. But I do remember having some great laughs. I think the heckling got recursive at one point, not sure though.

Sunday came, and with it, my sorry attempt to sit in the “Functional Testing Tools in Python” panel - which was thwarted by a 45 minute bloody nose. Apparently Titus talked a lot, so I don’t think I missed anything. End of main con.

Before I move into the sprints - I want to point out that I had fantastic hallway discussions with tons of people - people I look up to, and people I’ve never met before. I had excellent lunches with groups of people ranging from the highly experienced to someone who has learned python a week before. I can’t say enough about the simple fact that you get to relax, hang out and just talk with all of these people. Python is defined by it’s community - and ours is pretty awesome.

The sprints

Monday morning, I was off to the python core sprint. I immediately started hacking on multiprocessing bugs - I think all told, I managed to close around 12 or so bugs by the time thursday morning came around. The sprint was awesome simply due to the fact I could look across the room/table and ask any number of core developers questions. Martin Van Loewis came to my rescue/aid a few times, and being able to bounce ideas off of everyone is pure unbridled awesome.

Having dedicated face to face time to beat on these bugs and share information/debate things is immeasurable. I also got access to snakebite thanks to Trent, who is still working out the kinks. Alas, even having access did not make cross-platform support (mainly the BSDs) any easier for multiprocessing. I spent a good chunk of my time futzing with virtual machines so I could do further work - I ended up cutting that short because I wanted to fix things at the sprints: not fart around with tool chains.

Other than hacking until late at night through the sprints, there’s was quite a fun night where bourbon flowed, and many partook - much fun was had by all. There’s nothing quite like having drinks and making awful threading/process/compiler jokes.

Bug fixes/patch work for multiprocessing:

In Summary

PyCon is simply getting better - I’m looking forward to Atlanta next year, if nothing more than for better weather. I got to talk to a lot of great people - Collin Winter, Thomas Wouters, Brett Cannon (be careful, he’s an odd one), Guido, Alex Martelli and many others. Having many hallway discussions with language users and encouraging them to get involved - PyCon is AwesomeCon - no where else do you get to hear so much great information, discussions and points of view.

All of the videos for this pycon are going up on pycon.blip.tv.

It was a great time, and I can’t thank everyone involved enough.

PyCon: Concurrency/Distributed systems talk slides online


Slides from my intro to Concurrency/Distributed systems talk @pycon 2009 are here. Also up on the pycon site here

PyCon: Multiprocessing Talk Slides


Slides from my intro to multiprocessing talk @pycon 2009 are here. Also up on the pycon site here

Pycon: Unladen-Swallow


Tree-Swallow.jpgSo, by now some particular set of people (mainly those at the VM-Summit and twitter) have heard about unladen-swallow, a new project out of “the Google” which is working on providing some serious speed increases to the CPython interpreter.

This is being worked on by Collin Winter, Jeffery Yasskin and Thomas Wouters - it’s a branch of CPython: Not a Fork. Some of the improvements could possibly be rapidly integrated to python-trunk, some of them (such as using LLVM) are a longer road obviously, but given the people involved, and others in that arena, I could easily see this supplanting the current interpreter quickly.

But I’m biased, because they sped up CPickle (which is what multiprocessing uses for sharing data between processes). Oh, and they include psyco (port to 64 bit ok please).

The goals are nice, quoting a few choice ones from the project plan:

We want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow.

  1. Produce a version of Python at least 5x faster than CPython.
  2. Python application performance should be stable.
  3. Maintain source-level compatibility with CPython applications.
  4. Maintain source-level compatibility with CPython extension modules.
  5. We do not want to maintain a Python implementation forever; we view our work as a branch, not a fork.

And (from 2009 Q3 Goals):

In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM’s Recycler (Bacon et al, 2001).

Our long-term goal is to make Python fast enough to start moving performance-important types and functions from C back to Python.

The great thing is: They have a working implementation right now. Yessir, it’s not vapor! Hooray!

I’ve got it down, compiled and I’m futzing around with it now, yes it works. Unfortunately testing it I found a bug in multiprocessing (not unladen). Damn!

Job Opening: Senior Web Engineer (Django, Javascript)


A good friend of mine is looking for a strong web developer/engineer to help in a funded, early stage startup. I can not recommend applying for this enough, I know both founders and they really know their stuff. The position is in Natick, MA US.

Below is the job description, feel free to email me, or directly.

CONSULTING SOFTWARE ENGINEER

Nasuni is turning storage as service into a real product businesses can use. We will deliver unlimited protected storage at an exceptional value to our customers. We are bringing together a stellar engineering team to tackle the challenge. We value talent, intelligence and the desire to work hard as part of a team of A-players to create something new. Founded by storage veterans, Nasuni is backed by leading Boston venture capital firms North Bridge and Sigma Partners. Our headquarters are in Natick, MA.

Responsibilities
Nasuni is looking for a Javascript developer and leader with 5+ years of building DHTML- based interfaces for rich Internet applications. The role is for an engineer (as opposed to a designer) who is fluent in web development, expert in Javascript/DOM manipulation, and capable of working with a fast-moving and dynamic team across a whole series of disciplines to distill lots of functionality into something simple and beautiful.

Experience and Education

  • Expertise in HTML, CSS, DOM, and Javascript: these are highly dynamic sites with a lot of interactive pieces. Candidate must be able to look at a comp and quickly implement it in HTML+CSS and have good knowledge of how to hook into Javascript.
  • Attention to detail is a must: we’ll have high traffic Internet site with lots of traffic and award-winning design. Careful coordination/work with designers, and understanding about meeting design requirements is a necessary
  • Strong experience with Django (or a good argument for an alternative platform)
  • Test-driven development experience necessary. Developer is responsible for his own unit tests (JSUnit), as well as for designing functional test plans for QA
  • Successful track record of leading agile web development teams

Strongly Desired

  • Knowledge of one of the following open source Javascript libraries is a big plus: jQuery, Prototype, Yahoo UI Library.
  • Experience with any internal or external cloud (i.e. Amazon S3/EC2) services and APIs
  • eCommerce knowledge
  • Experience with relational databases (Postgresql, MySQL, Oracle, SQLServer, etc)
  • Understanding of I18N and L10N issues in globally targeted websites, particularly for multibyte languages
  • Open source philosophy

Additional information
You should be able to point to a couple AJAX-based web apps that you’ve built and be able to discuss them. Demonstration of previous work is required.

Must be authorized to work in the United States on a full-time basis for any employer.

So you want to use python on the mac?

4655664E-CBA5-4AF4-B813-87854FC67289.jpgIn a complete tangent from my numerous other projects, I’ve had a few people ask me recently about python on the mac, how to get started/etc.

I’m going to solely focus on python in Leopard (10.5.x) and not anything before that. Anything before that is dead to me! DEAD!

Continue Reading »

Generating re-creatable random files…

… And the case of obsessive optimization. A little while ago, I posted a small snippet of code that was designed to generate data files of a given size, based off a seed very quickly (article here). The goals of this code is/was the following:

  • Generate large amounts of semi-random data quickly
  • Data generation can not use /dev/urandom or other system entropy buckets. These are to slow, and having hundred of threads pulling from these buckets is a bad idea. Oh - and it needs to work on windows.
  • The data must never be sync’ed to disk: when you’re generating a large data set, on the scale of hundreds of millions of files, storing it on disk sucks, and the disk becomes the bottleneck.
  • Creation of the files must be at least 1 gigabit/second - this means a single thread passing one of these generators to say, a pycurl handle could “in theory” hit line speed: the generator can not be the bottleneck
  • The data in theses files must be able to be recreated at any time provided you have the seed.
  • Setting a seed in python’s random() has side-effect issues, and can not be used. Besides, lots of random calls are expensive.
  • I need the ability to swap out the data source, I use a lorem file here, but a different type will be needed later.
  • The data source should only be parsed once for the import (singleton, ho!)
  • The name, and the file data must be unique - they must hash differently (to prevent de-dupers from, well, de duping them)

I am revisiting this code as we found out the original version could only generate file data at around 500 megabits/second. This is much too slow for my tastes, as I might as well be reading it from disk. We can make it faster.

After cleaning things up, removing some overly complex logic (and several moments of “what the hell was I thinking”), I came up with this:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
LOREM = os.path.join(os.path.dirname(__file__), "datafiles", "lorem.txt")
WORDS = open(LOREM, "r").read().split()
 
def chunker(size, seed, chunksize=1000):
    word_q = collections.deque(WORDS)
    seed_q = collections.deque(int(i) for i in str(seed))
    # Rotate the word_q by the seed so that small files are unique.
    word_q.rotate(seed)
    current_size = size
    while current_size > 0:
        data = ' '.join(word_q)
        if chunksize > current_size:
            chunksize = current_size
        chunksize = (yield data[0:chunksize]) or chunksize
        current_size -= chunksize
        word_q.rotate(seed_q[0])
        seed_q.rotate(1)
 
class SyntheticFile(object):
    """ File-Like object backed by the ``chunker`` function. Allows the
    construction of an object which can be passed to something like a pycurl
    handle streaming data to a server """
    def __init__(self, size, seed):
        """ 
        **size**: integer, bytes
        **seed**: integer
        **chunksize**: optional, integer
        """
        self.chunker = None
        self.size = size
        self.seed = seed
 
    def write(self):
        """ unsupported, throw an error if called """
        raise Exception('not supported')
 
    def read(self, readsize):
        """ Support read() - **readsize** is in bytes. """
        if not self.chunker:
            self.chunker = chunker(self.size, self.seed, readsize)
            return self.chunker.next()
        try:
            return self.chunker.send(readsize or 1000)
        except StopIteration:
            pass
        return ""

This version hit around 618 megabits/second and it used the generator’s send() capability to allow readers using the SyntheticFile implementation to alter the chunk size they’re reading on the fly, which is important if you have a consumer that wants the ability to read small/read big/read small. Well, that’s fine and all, but I was stymied - I wanted to make this thing fly. I want to be able to generate this data at at least 1 gigabit/second, if not faster.

Astute readers may point out that there’s other ways of doing this - mmap, simply embedding the unique seed or a uuid - well, this story isn’t about that, is it?

In any case, I suspected the “data = ‘ ‘.join(word_q)” line was the culprit - deque is pretty optimized, and I had removed a massive chunk of code which didn’t make sense, and in fact, cProfile showed I was right:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441475 function calls in 203.791 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352  180.282    0.000  180.282    0.000 {method 'join' of 'str'  objects}
...snip...

180 out of 203 cpu seconds, on the join alone. Curses! So this is when I really went mental (this is what happens when you’re too close to something). I decided that I needed to find some magical way of skipping the join and only reading what I needed. I ran down that rathole for a bit, until a friend of mine point out “just make the words bigger”.

Full stop. I initially discounted it, I was zeroed in on that join - oh wait. The text in the lorem file when split on whitespace is 4368 words. Joining those back together within the loop is expensive - that much I knew. I hit on the idea that if instead of considering them words, I thought of them as chunks (which is how I was treating them).

I added a method (process_chunks) which treated the data source as chunks of bytes and made the WORDS variable a list of those chunks. Initially, I set the chunk size to 100 (bytes) and here’s the cProfile output:

woot:synthfiles jesse$ python -m cProfile synthfilegen.py
         2441766 function calls in 51.549 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...snip...
   610352   29.712    0.000   29.712    0.000 {method 'join' of 'str' objects}
...snip...

And now the generator is kicking data out at 2.34 gigabits. Huge success. Obviously, if you increase the chunk size, it speeds up a bit more (e.g. 300 byte chunks is about 2.5 gigabits/second). I cleaned it up a bit and here is the code:

(thanks! bitbucket.org).
Note that the speeds I’m discussing are passing the SynthFileObject to a pycurl handle and streaming it across the wire: not to disk.

All told, it was a fun little jaunt, and I’ve succeeded to make something which I considered “throwaway” into something that’s a lot more useful, clean and fast. I’ve added a handful of unit tests to my sandbox, and I might make this a real module if anyone wants it. I want to rework the _process_chunk/globals stuff, but I farted around with this long enough for now. I also want to add the ability to remove the chunking altogether and simply insert the seed into the data response, and not mess with the lorem text.

edit: I just checked in a new version which removes the _process_chunks function and other globals and moves them into a class. I hate globals.

Sphinx and auto-building/tests

cat.jpgOk, so I’m think I’m in sphinx-love. I’ve needed to really begin a largish documentation project for a code base I own and drive (omgmanagerspeak) and since I’d rather not completely rely on API docs, and I have exposure to sphinx courtesy of python-core work, I chose you sphinx-a-chu!

Sphinx really is awesome. I started just chugging through the docs, and ended up pulling from tip (via mercurial) and using the latest version for the theme support Georg added recently.

Rather than rehash the basics, I’ll simple explain my setup, and why I love it - starting with the output from sphinx-quickstart which kicks out a makefile for you, I immediately turned on the ‘sphinx.ext.autodoc‘ and ‘sphinx.ext.doctest‘ extensions - the former allows you to tell sphinx (in your rst file) to delegate the documentation for this class/method/etc to the API docs, and the latter allows you to pass the examples/snippets you will have through doctest.

These two things make writing/testing the docs awesome - but it gets more awesome.

With sphinx, you can run make html to run your build, part of which is the verification of your .rst syntax. I added a target to the Makefile to run the doctests:

doctest:
	mkdir -p build/doctests
	$(SPHINXBUILD) -b doctest source build/doctests
	@echo
	@echo "Doctests passed"

This means I can run “make doctests html” and I get everything I want. Hooray!

But wait.. I don’t want to have to edit, build, and open the html to review/read it - we can make this smarter, faster.

So, loaded up Bruno Bord’s tdaemon and hacked it up so that I could pass it in a custom command (I added a sphinx argument), so after I point it at my doc/source/ directory - any time I hit save the “make doctest html” target runs.

This is nice - it’s instant build/feedback. I coupled this with the simple “python -m SimpleHTTPServer” running in the build/html/ directory of the docs and leave my browser pointed at localhost:8000.

All I need to do is hit save, and provided I haven’t failboated the tests, I can just hit refresh and see things immediately.

I’m probably going to hack tdaemon up a bit to either accept a command series, or allow arbitrary commands so I can remove the hardcoded sphinx logic I added, and possibly add the capability of watching multiple directories (and supporting custom (different) commands to each.

This way, I could say “tdaemon –dir1=sphinxsource –cmd=”make doctests html” –dir2=pythondir –cmd=”nosetests -s -v”. This way if I hit save on any file I’m working on I get this same feedback loop.

I can hack this so it auto-tests/deploys a Django application as well (on changes) to my local sandbox as well. Just something neat, that’s working well for me. Additionally, I want to add some intelligence that in the case of a failure, the log file is automatically opened in the editor of my choice via the OS/X open command.

I should add, that I also hacked tdaemon to modify my PYTHONPATH to add the project I’m hacking on so I don’t need to install/deploy it for the API integration in sphinx to work. This allows me to have this type of setup on a per-branch basis.