Stackless: You got your coroutines in my subroutines.

by jesse in , , ,


Note:This is another post in what I hope will be a series leading up to my concurrency/distributed systems talk at PyCon. I'm steadily working through experimenting with and learning the various frameworks/libraries in the python ecosystem. I reserve the right (and probably will) to revise these entries based on feedback from people (mainly the author(s) of said tool(s)). I will also add additional bits and pieces as I learn and explore more./Note

Stackless python - here's another big one on the pile - is much more than a library, or a framework which runs on CPython - Stackless is actually a modified version of the CPython interpreter. It's much more than just a C-extension. Stackless is in use by various people and companies - most notably, it's in use by CCP Games, makers of Eve Online (see this pycon presentation). In fact, CCP Games is a large part of why Stackless is still around today.

I say that intentionally - quoting the readme in the Stackless/ directory of the distribution(here):

In 2003, the fabulous PyPy porject was started, which is still performing very well. I have implemented Stackless for PyPy (with lots of support from Armin), and it is so incredibly much nicer. No more fiddling with existing calling conventions, no compromizes, everything that needs to be stackless also is.

Unfortuantely, PyPy is still not fast and complete enough to take over. This means, my users are still whining for an update all the time CPython gets an update. And maintaining this code gets more and more a nightmare for me, since I have the nice PyPy version, and I hate hacking this clumsy C code again and again.

The original author, Christian Tismer largely moved onto PyPy, which is still largely in it infancy (although I read through bits of the code base frequently, it's pretty), and further development has largely been stalled minus the improvements Richard M. Tew (CCP Games) and others have done. There's still life in it.

Fundamentally, Stackless modifies the interpreter internals a bit to modify the way that the C call stack is manipulated/used as well as to add other other nice bits Stackless offers (call stack). Stackless simply doesn't use the C call stack - all told, each microthread only has a few kilobytes of overhead, which is awesome.

It adds something called microthreads and does other patching to python-core. Normal OS/Posix threads require a fair amount of resources to create and run - in the case of Python, each thread has to get its own stack, this costs memory. With Stackless' microthread support - you get "threads", but threads which cost a significantly less, and potentially execute faster due to context switching improvements (no need to go from user->kernel->user and so on).

Point of Order: Before I continue, I want to clear up a common misconception I've heard - Stackless, does not in any way, remove the Global Interpreter Lock. No sir. It's still there. Lurking. Waiting to steal your candy. Also, it still has a stack, so it's not truly "stackless".

So, microthreads are smaller and require less OS hand holding for context switching, and ultimately can (and are) scheduled by the interpreter, rather than the operating system.

install note for os/x users: You need to pass the "--enable-stacklessfewerregisters" to configure, otherwise, make pukes on you.

Stackless is a basic implementation of these - for a simple resource example usage example, I wrote a simple script which spawns 2,000 threads (sorry windows) and 2,000 tasklets. I watched the memory usage of both:

import time
import threading
def func():
    time.sleep(120)

threads = [threading.Thread(target=func) for i in range(2000)]
for i in threads:
    i.start()
for i in threads:
    i.join()

For the threaded script - the resident size was 42M and the virtual size was 1037. Versus the stackless version:

import time
import stackless

def func():
    for i in range(120):
        time.sleep(1)
        stackless.schedule()

for i in range(2000):
    stackless.tasklet(func)()

stackless.run()

Stackless had a resident size of 3416K and a virtual size of 22M - virtually microscopic versus the heavier thread version. Obviously, they are not line for line comparisons - the Stackless version, like other cooperative multitasking systems requires that each tasklet be a good citizen, and not block execution forever, instead rescheduling itself or otherwise yielding to allow for others to run. If a tasklet blocks on a socket, everyone blocks on that tasklet.

Someone asked me to track the linear growth of the threaded numbers vs. the tasklet numbers. Since I'm a sucker, I thought I'd take him up on it (OS/X 10.5, 4GB of ram, Core 2 Duo):

Threads:

Num Threads Resident Size Virtual Size
2 3412K 23M
200 7336K 123M
2,000 42M 1037M

Tasklets:

Num Tasklets Resident Size Virtual Size
2 3128K 21M
200 3164K 21M
2,000 3408K 22M
20,000 5920K 24M
100,000 17M 34M

One note - the Stackless numbers should be low, but not this low (from my understanding, and review from others), anyone have any ideas?

There's the numbers - lots of threads is going to consume lots of ram. With stackless, a given tasklet is only a few kilobytes in average size and therefore the memory footprint is small when you start raising the count. Additionally, note the two counts at the bottom of the tasklets table; you can't spawn that many threads (depending on your OS and configuration) and even if you could, the memory footprint would be costly.

Now, in the age of cheap-ass-ram, where you can trick out a desktop or server with 16GB sticks, people might argue "so what" - but on machines where memory is constrained, such as smaller notebooks, embedded devices, or game consoles - this is a critical thing to take into consideration.

If you look at the stackless code, there is another big thing to realize; Stackless like other frameworks or systems which use an scheduler built into the interpreter gives you the benefit/task of scheduling when your tasklets/components/etc execute. This gives you more control, but more responsibility. Stackless offers both cooperative and preemptive scheduling, however the preemptive scheduling doesn't feel right. more on scheduling here

So, we've determined that stackless tasklets are smaller, right? Pretty simple.

If you've read the other things I've written on Kamaelia/Twisted/etc, you'll recognize the concepts within Stackless pretty quickly - a tasklet is a component, a thread of work and tasklets intercommunicate via channels. For example, here's a little example of two tasklets communicating:

import stackless

def chicken(channel):
    channel.send('cluck')

def egg(channel):
    print channel.receive()

channel = stackless.channel()
stackless.tasklet(chicken)(channel)
stackless.tasklet(egg)(channel)
stackless.run()

Pretty easy, and a tiny amount of code. The concept of tasklets/microthreads isn't a new one - in fact, it's how Erlang gets it's groove on - Erlang doesn't use native OS threads, instead, it uses microthreads scheduled by the Erlang compiler. However, they are not directly comparable. Stackless isn't running across cores - Erlang does, stackless, due to the GIL, has to obey the same rules as the rest of python-core. For more on "erlang v. stackless", see this.

Oh, and you can share normal object via the channel too:

import stackless

def yes(channel):
    x = channel.receive()
    x.append('yes')
    channel.send(x)

def no(channel):
    x = channel.receive()
    x.append('no')
    channel.send(x)

channel = stackless.channel()
stackless.tasklet(yes)(channel)
stackless.tasklet(no)(channel)
channel.send([])
stackless.run()
print channel.receive()

Moving on, Stackless offers something else - the ability to pickle tasklets. This means you can pickle up a tasklet and send it over the wire to another machine and then unpickle it and continue running it - channels get pickled too. Locally, this means to can save it to disk, and then resume state easily.

You could use this to generate a tasklet which listened on a port for data on the local machine, and passed the data off the wire to the channel - when you pickled the channel or the tasklets with that channel in scope, and sent it over the wire, they would pick up listening on the same port number on the remote machine. You loose current sessions, yes, but you could also detect active sessions and handle those gracefully.

This is nice for say, a component you wanted to be able to easily send to other machines to help load balancing. In theory, you could auto-detect new servers being added to a cluster, and when that server came up into a "ready" state, send it the daemon it should handle - and the tasklets would pick up where they left off (minus the sessions).

Otherwise, pickling channels and tasklets could be used for a few things - you have to think of them in terms of coroutines (here) - you should be able to suspend processing state and then simply resume where you left off. If you've got python brains - picke-able generators. You could put them in a database; but the use of that escapes me at the moment.

Oh - and pickled tasklets and channels can be shared amongst different architectures too, as long as that architecture is running the same version of Stackless, and supports Stackless.

To continue on - if you wanted to add threads into the mix as well - Stackless tasklets can be run within Python threads, however those tasklets are local to that thread, and each thread gets it's own scheduler. Your main application thread has it's own scheduler, and so on.

Stackless' tasklet/channel system is quite nice, however note that I'm not saying Stackless is the only way into this magical world - it's not, especially with a plethora of coroutine/greenlet/etc packages for python today, and the continued work towards making generators more awesome. I'm just showing what Stackless can/could do.

The primitives within Stackless are nice - frankly, I'd like a light weight green thread implementation in python core on which we could build a nice Actor library, as well as support the lower memory footprint/etc. However, in order to use these primitives within Stackless - you'd find yourself building your own abstraction layer/framework (for example, concurrence) to really get a lot of mileage out of it. This is why people run twisted on top of it, CCP Games has the uthread library (which you can see here) and so on.

The cost of a deployment of Stackless can not be underestimated though - it's got some magic assembler code within it, which isn't the most portable of goods (versus OSes, compiler versions, compilers, etc). Some platforms simply aren't supported due to this. Not to mention, it's an entirely new interpreter, which has a cost much higher than that of an extension module.

A few people who I've been talking with asked me the simple question - "Why hasn't any of this been pushed into python-core". Well, in short - it was never really proposed (by Christian), and the changes within Stackless - the last serious discussion was from 2007 (see this).

With Stackless, it's difficult - I think there is a perceived complexity about the code and then there is real complexity. I suspect both of these are high in the case of Stackless due to the nature of the problem it is trying to solve; namely bolting a feature like this onto an interpreter not meant for it. I think that due to this, and due to Christian and others moving onto the greener pastures of PyPy - inclusion into core simply won't happen.

Resources: