February 27th, 2009 § § permalink
… And the case of obsessive optimization. A little while ago, I posted a small snippet of code that was designed to generate data files of a given size, based off a seed very quickly (article here). The goals of this code is/was the following:
- Generate large amounts of semi-random data quickly
- Data generation can not use /dev/urandom or other system entropy buckets. These are to slow, and having hundred of threads pulling from these buckets is a bad idea. Oh — and it needs to work on windows.
- The data must never be sync’ed to disk: when you’re generating a large data set, on the scale of hundreds of millions of files, storing it on disk sucks, and the disk becomes the bottleneck.
- Creation of the files must be at least 1 gigabit/second — this means a single thread passing one of these generators to say, a pycurl handle could “in theory” hit line speed: the generator can not be the bottleneck
- The data in theses files must be able to be recreated at any time provided you have the seed.
- Setting a seed in python’s random() has side-effect issues, and can not be used. Besides, lots of random calls are expensive.
- I need the ability to swap out the data source, I use a lorem file here, but a different type will be needed later.
- The data source should only be parsed once for the import (singleton, ho!)
- The name, and the file data must be unique — they must hash differently (to prevent de-dupers from, well, de duping them)
I am revisiting this code as we found out the original version could only generate file data at around 500 megabits/second. This is much too slow for my tastes, as I might as well be reading it from disk. We can make it faster.
After cleaning things up, removing some overly complex logic (and several moments of “what the hell was I thinking”), I came up with this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| LOREM = os.path.join(os.path.dirname(__file__), "datafiles", "lorem.txt")
WORDS = open(LOREM, "r").read().split()
def chunker(size, seed, chunksize=1000):
word_q = collections.deque(WORDS)
seed_q = collections.deque(int(i) for i in str(seed))
# Rotate the word_q by the seed so that small files are unique.
word_q.rotate(seed)
current_size = size
while current_size > 0:
data = ' '.join(word_q)
if chunksize > current_size:
chunksize = current_size
chunksize = (yield data[0:chunksize]) or chunksize
current_size -= chunksize
word_q.rotate(seed_q[0])
seed_q.rotate(1)
class SyntheticFile(object):
""" File-Like object backed by the ``chunker`` function. Allows the
construction of an object which can be passed to something like a pycurl
handle streaming data to a server """
def __init__(self, size, seed):
"""
**size**: integer, bytes
**seed**: integer
**chunksize**: optional, integer
"""
self.chunker = None
self.size = size
self.seed = seed
def write(self):
""" unsupported, throw an error if called """
raise Exception('not supported')
def read(self, readsize):
""" Support read() - **readsize** is in bytes. """
if not self.chunker:
self.chunker = chunker(self.size, self.seed, readsize)
return self.chunker.next()
try:
return self.chunker.send(readsize or 1000)
except StopIteration:
pass
return "" |
This version hit around 618 megabits/second and it used the generator’s send() capability to allow readers using the SyntheticFile implementation to alter the chunk size they’re reading on the fly, which is important if you have a consumer that wants the ability to read small/read big/read small. Well, that’s fine and all, but I was stymied — I wanted to make this thing fly. I want to be able to generate this data at at least 1 gigabit/second, if not faster.
Astute readers may point out that there’s other ways of doing this — mmap, simply embedding the unique seed or a uuid — well, this story isn’t about that, is it?
In any case, I suspected the “data = ’ ‘.join(word_q)” line was the culprit — deque is pretty optimized, and I had removed a massive chunk of code which didn’t make sense, and in fact, cProfile showed I was right:
woot:synthfiles jesse$ python -m cProfile synthfilegen.py
2441475 function calls in 203.791 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
...snip...
610352 180.282 0.000 180.282 0.000 {method 'join' of 'str' objects}
...snip...
180 out of 203 cpu seconds, on the join alone. Curses! So this is when I really went mental (this is what happens when you’re too close to something). I decided that I needed to find some magical way of skipping the join and only reading what I needed. I ran down that rathole for a bit, until a friend of mine point out “just make the words bigger”.
Full stop. I initially discounted it, I was zeroed in on that join — oh wait. The text in the lorem file when split on whitespace is 4368 words. Joining those back together within the loop is expensive — that much I knew. I hit on the idea that if instead of considering them words, I thought of them as chunks (which is how I was treating them).
I added a method (process_chunks) which treated the data source as chunks of bytes and made the WORDS variable a list of those chunks. Initially, I set the chunk size to 100 (bytes) and here’s the cProfile output:
woot:synthfiles jesse$ python -m cProfile synthfilegen.py
2441766 function calls in 51.549 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
...snip...
610352 29.712 0.000 29.712 0.000 {method 'join' of 'str' objects}
...snip...
And now the generator is kicking data out at 2.34 gigabits. Huge success. Obviously, if you increase the chunk size, it speeds up a bit more (e.g. 300 byte chunks is about 2.5 gigabits/second). I cleaned it up a bit and here is the code:
(thanks! bitbucket.org).
Note that the speeds I’m discussing are passing the SynthFileObject to a pycurl handle and streaming it across the wire: not to disk.
All told, it was a fun little jaunt, and I’ve succeeded to make something which I considered “throwaway” into something that’s a lot more useful, clean and fast. I’ve added a handful of unit tests to my sandbox, and I might make this a real module if anyone wants it. I want to rework the _process_chunk/globals stuff, but I farted around with this long enough for now. I also want to add the ability to remove the chunking altogether and simply insert the seed into the data response, and not mess with the lorem text.
edit: I just checked in a new version which removes the _process_chunks function and other globals and moves them into a class. I hate globals.
... And the case of obsessive optimization. A little while ago, I posted a small snippet of code that was designed to generate data files of a given size, based off a seed very quickly (article here). The goals of this code is/was the following:
Generate large amounts of semi-random data quickly
Data generation ...
February 26th, 2009 § § permalink
Ok, so I’m think I’m in sphinx-love. I’ve needed to really begin a largish documentation project for a code base I own and drive (omgmanagerspeak) and since I’d rather not completely rely on API docs, and I have exposure to sphinx courtesy of python-core work, I chose you sphinx-a-chu!
Sphinx really is awesome. I started just chugging through the docs, and ended up pulling from tip (via mercurial) and using the latest version for the theme support Georg added recently.
Rather than rehash the basics, I’ll simple explain my setup, and why I love it — starting with the output from sphinx-quickstart which kicks out a makefile for you, I immediately turned on the ‘sphinx.ext.autodoc’ and ‘sphinx.ext.doctest’ extensions — the former allows you to tell sphinx (in your rst file) to delegate the documentation for this class/method/etc to the API docs, and the latter allows you to pass the examples/snippets you will have through doctest.
These two things make writing/testing the docs awesome — but it gets more awesome.
With sphinx, you can run make html to run your build, part of which is the verification of your .rst syntax. I added a target to the Makefile to run the doctests:
doctest:
mkdir -p build/doctests
$(SPHINXBUILD) -b doctest source build/doctests
@echo
@echo "Doctests passed"
This means I can run “make doctests html” and I get everything I want. Hooray!
But wait.. I don’t want to have to edit, build, and open the html to review/read it — we can make this smarter, faster.
So, loaded up Bruno Bord’s tdaemon and hacked it up so that I could pass it in a custom command (I added a sphinx argument), so after I point it at my doc/source/ directory — any time I hit save the “make doctest html” target runs.
This is nice — it’s instant build/feedback. I coupled this with the simple “python –m SimpleHTTPServer” running in the build/html/ directory of the docs and leave my browser pointed at localhost:8000.
All I need to do is hit save, and provided I haven’t failboated the tests, I can just hit refresh and see things immediately.
I’m probably going to hack tdaemon up a bit to either accept a command series, or allow arbitrary commands so I can remove the hardcoded sphinx logic I added, and possibly add the capability of watching multiple directories (and supporting custom (different) commands to each.
This way, I could say “tdaemon –dir1=sphinxsource –cmd=“make doctests html” –dir2=pythondir –cmd=“nosetests –s –v”. This way if I hit save on any file I’m working on I get this same feedback loop.
I can hack this so it auto-tests/deploys a Django application as well (on changes) to my local sandbox as well. Just something neat, that’s working well for me. Additionally, I want to add some intelligence that in the case of a failure, the log file is automatically opened in the editor of my choice via the OS/X open command.
I should add, that I also hacked tdaemon to modify my PYTHONPATH to add the project I’m hacking on so I don’t need to install/deploy it for the API integration in sphinx to work. This allows me to have this type of setup on a per-branch basis.
Ok, so I'm think I'm in sphinx-love. I've needed to really begin a largish documentation project for a code base I own and drive (omgmanagerspeak) and since I'd rather not completely rely on API docs, and I have exposure to sphinx courtesy of python-core work, I chose you sphinx-a-chu!
Sphinx really is awesome. I started ...
February 23rd, 2009 § § permalink
Note:This is another post in what I hope will be a series leading up to my concurrency/distributed systems talk at PyCon. I’m steadily working through experimenting with and learning the various frameworks/libraries in the python ecosystem.
I reserve the right (and probably will) to revise these entries based on feedback from people (mainly the author(s) of said tool(s)). I will also add additional bits and pieces as I learn and explore more./Note
Stackless python — here’s another big one on the pile — is much more than a library, or a framework which runs on CPython — Stackless is actually a modified version of the CPython interpreter. It’s much more than just a C-extension. Stackless is in use by various people and companies — most notably, it’s in use by CCP Games, makers of Eve Online (see this pycon presentation). In fact, CCP Games is a large part of why Stackless is still around today.
» Read the rest of this entry «
Note:This is another post in what I hope will be a series leading up to my concurrency/distributed systems talk at PyCon. I'm steadily working through experimenting with and learning the various frameworks/libraries in the python ecosystem.
I reserve the right (and probably will) to revise these entries based on feedback from people (mainly the author(s) ...
February 21st, 2009 § § permalink
I, along with a whole heck of a lot of other people will be attending PyCon in march. You should know about this by now, unless you’re living under a rock, or in a shoebox (I like shoeboxes).
PyCon 09 is turning out to be one of the ones I am most excited about in some time — barring the fact they let me actually stand up and speak about something, there’s a ton of other excellent and exciting things going on.
I will be doing two talks — “Introduction to Multiprocessing in Python” on Friday, at 3:20 PM, and “Concurrency and Distributed Computing with Python Today” at 3:20 PM Saturday.
The former talk is easy, given it will be focused on introducing the multiprocessing module to the masses, and taking questions about it (be gentle, I just work here). The latter is a much bigger beast. I’ve been blogging about my research in my pycon 2009 category, and I still have a pile of things to keep adding to that. The talk will attempt to differentiate concurrency from distributed systems, and show the various toolkits/frameworks/etc in the ecosystem today, to help you build both types of systems.
Given both talks are 45 minutes in length, I will be publishing my talk notes (my slides are not going to be heavy weight) and other information here.
In addition to me speaking, which may or may not be exciting, there’s one helluva ton of other talks which simply look awesome.
My schedule looks like this:
- Thursday: Python Language Summit, where I will endeavor to be smart.
- Friday: How to give a python talk
- Friday: Using Windmill
- Friday: Introduction to Python Profiling or How Python is Developed, to harass Brett.
- Friday: Panel — Python VMs
- Friday: Building an Automated QA infrastructure using Open Source Tools
- Friday: Twisted, AMQP and Thrift: Bridging messaging and RPC for building scalable distributed applications
- Friday: My Talk (I figure I should go to it)
- Friday: A Whirlwind Excursion through Writing a C Extension or Challenges and Opportunities for Python
- Thursday: Plugins and monkeypatching: increasing flexibility, dealing with inflexibility
- Saturday: The (lack of) design patterns in Python or the Pinax talk
- Saturday: Class Decorators: Radically Simple
- Saturday: Panel: Object Relational Mappers: Philosophies and Design Decisions.
- Saturday: Drop ACID and think about data (I wonder if we can take this literally, if he has scary slides though, we all might trip balls)
- Saturday: My Talk, which is up against Bruce Eckel and Raymond H — I expect no one to show up.
- Sunday: Panel: Functional Testing Tools in Python
- Sunday: Designing a web framework: Django’s design decisions
This doesn’t even cover the open space discussions which I might attend — including the “Writing About Python” one Doug Hellmann is putting together as well as the “Teach Me Web Testing” one by Steve Holden.
After the main conference, I’ll be sticking around until Thursday morning for the sprints, at which point I should be sufficiently burned out on python stuff, I will fully convert over to being a full time burger flipper.
Right now there are 620 registered people who made their attendance public I don’t know how many there are in total.
Hope to see you there!
I, along with a whole heck of a lot of other people will be attending PyCon in march. You should know about this by now, unless you're living under a rock, or in a shoebox (I like shoeboxes).
PyCon 09 is turning out to be one of the ones I am most excited ...
February 11th, 2009 § § permalink
Note:This is the third post in what I hope will be a series leading up to my concurrency/distributed systems talk at PyCon. I’m steadily working through experimenting with and learning the various frameworks/libraries in the python ecosystem.
I reserve the right (and probably will) to revise these entries based on feedback from people (mainly the author(s) of said tool(s)). I will also add additional bits and pieces as I learn and explore more. Additionally, thanks to glyph for giving me a hell of a lot of feedback./Note
Twisted is the 800 lbs gorilla of the “concurrency” frameworks. It’s been around for awhile, has a large following — it’s used by everyone from Apple (iCal server) to Buildbot Buildbot. It has a literal ton of sub projects and other “semi attached appendages”.
» Read the rest of this entry «
Note:This is the third post in what I hope will be a series leading up to my concurrency/distributed systems talk at PyCon. I'm steadily working through experimenting with and learning the various frameworks/libraries in the python ecosystem.
I reserve the right (and probably will) to revise these entries based on feedback from people (mainly ...
February 5th, 2009 § § permalink
OpenSSH is the ubiquitous method of remote access for secure remote-machine login and file transfers. Many people — systems administrators, test automation engineers, web developers and others have to use and interact with it daily. Scripting SSH access and file transfers with Python can be frustrating — but the Paramiko module solves that in a powerful way.
This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the October 2008 issue. I have republished this in its original form, bugs and all
» Read the rest of this entry «
OpenSSH is the ubiquitous method of remote access for secure remote-machine login and file transfers. Many people -- systems administrators, test automation engineers, web developers and others have to use and interact with it daily. Scripting SSH access and file transfers with Python can be frustrating -- but the Paramiko module solves that in ...
February 4th, 2009 § § permalink
This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the August 2008 issue.
In the early summer of this year I had the chance to really get started working on/with the core Python source. I had spent some time putting together a Python Enhancement Proposal (PEP) which was accepted. Now, I just needed to learn the code base, practices and buy a helmet. Shortly after getting the initial patch accepted, I ended up breaking the build, tests and caused the beta to slip. This article is an introduction to Core development, in which we’ll cover what you need to get started, and where I personally screwed up.
» Read the rest of this entry «
This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the August 2008 issue.
In the early summer of this year I had the chance to really get started working on/with the core Python source. I had spent some time putting together a ...
February 3rd, 2009 § § permalink
One of the cooler features that came with Python 2.5’s release is the ‘with’ statement and the context manager protocol behind it. I could make the argument that these two things alone make the upgrade to Python 2.5 more than compelling for those of you trapped in the dark ages of 2.4 or worse: 2.3!
This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the July 2008 issue. I have republished this in its original form, bugs and all
» Read the rest of this entry «
One of the cooler features that came with Python 2.5's release is the 'with' statement and the context manager protocol behind it. I could make the argument that these two things alone make the upgrade to Python 2.5 more than compelling for those of you trapped in the dark ages of 2.4 or worse: ...
February 2nd, 2009 § § permalink
This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the June 2008 issue.
A world without a Global Interpreter Lock (GIL) — the very thought of it makes some people very, very happy. At PyCon 2007 Guido openly stated that he would not be against a GIL-less implementation of Python, provided someone coughed up the patch itself. Right now, that someone is Adam Olsen — an amateur programmer who has been working on a patch to the CPython interpreter since July of 2007.
It’s PyCon. I’m supposed to be listening to a talk, but I’ve fallen down the rabbit hole of a future without a global interpreter lock. I’m locked in on getting a patched version of the interpreter up and running on Mac OS/X and the patch author, Adam Olsen, is coaching me through changes to some of the deepest internals of Python itself.
» Read the rest of this entry «
This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the June 2008 issue.
A world without a Global Interpreter Lock (GIL) - the very thought of it makes some people very, very happy. At PyCon 2007 Guido openly stated that he would ...
February 1st, 2009 § § permalink
There are a plethora of mechanisms and technologies surrounding concurrent programming — Python has support for many of them. In this article we will explain, examine, and benchmark Python’s threading support, and discuss the much maligned Global Interpreter Lock (GIL).
This is a reprint of a featured article I wrote for Python Magazine that was published in the December 2007 issue. This article assisted in inspiring me to write PEP 371.
» Read the rest of this entry «
There are a plethora of mechanisms and technologies surrounding concurrent programming -- Python has support for many of them. In this article we will explain, examine, and benchmark Python's threading support, and discuss the much maligned Global Interpreter Lock (GIL).
This is a reprint of a featured article I wrote for Python Magazine that was ...