Stackless: You got your coroutines in my subroutines.

February 23rd, 2009 § 7 comments

Note:This is another post in what I hope will be a series lead­ing up to my concurrency/distributed sys­tems talk at PyCon. I’m steadily work­ing through exper­i­ment­ing with and learn­ing the var­i­ous frameworks/libraries in the python ecosystem.

I reserve the right (and prob­a­bly will) to revise these entries based on feed­back from peo­ple (mainly the author(s) of said tool(s)). I will also add addi­tional bits and pieces as I learn and explore more./Note

Stack­less python — here’s another big one on the pile — is much more than a library, or a frame­work which runs on CPython — Stack­less is actu­ally a mod­i­fied ver­sion of the CPython inter­preter. It’s much more than just a C-extension. Stack­less is in use by var­i­ous peo­ple and com­pa­nies — most notably, it’s in use by CCP Games, mak­ers of Eve Online (see this pycon pre­sen­ta­tion). In fact, CCP Games is a large part of why Stack­less is still around today.

I say that inten­tion­ally — quot­ing the readme in the Stackless/ direc­tory of the dis­tri­b­u­tion(here):

In 2003, the fab­u­lous PyPy por­ject was started, which is still
per­form­ing very well. I have imple­mented Stack­less for PyPy
(with lots of sup­port from Armin), and it is so incred­i­bly
much nicer. No more fid­dling with exist­ing call­ing con­ven­tions,
no com­pro­mizes, every­thing that needs to be stack­less also is.

Unfor­tu­an­tely, PyPy is still not fast and com­plete enough to
take over. This means, my users are still whin­ing for an update
all the time CPython gets an update.
And main­tain­ing this code gets more and more a night­mare for
me, since I have the nice PyPy ver­sion, and I hate hack­ing this
clumsy C code again and again.

The orig­i­nal author, Chris­t­ian Tismer largely moved onto PyPy, which is still largely in it infancy (although I read through bits of the code base fre­quently, it’s pretty), and fur­ther devel­op­ment has largely been stalled minus the improve­ments Richard M. Tew (CCP Games) and oth­ers have done. There’s still life in it.

Fun­da­men­tally, Stack­less mod­i­fies the inter­preter inter­nals a bit to mod­ify the way that the C call stack is manipulated/used as well as to add other other nice bits Stack­less offers (call stack). Stack­less sim­ply doesn’t use the C call stack — all told, each microthread only has a few kilo­bytes of over­head, which is awesome.

It adds some­thing called microthreads and does other patch­ing to python-core. Nor­mal OS/Posix threads require a fair amount of resources to cre­ate and run — in the case of Python, each thread has to get its own stack, this costs mem­ory. With Stack­less’ microthread sup­port — you get “threads”, but threads which cost a sig­nif­i­cantly less, and poten­tially exe­cute faster due to con­text switch­ing improve­ments (no need to go from user->kernel->user and so on).

Point of Order: Before I con­tinue, I want to clear up a com­mon mis­con­cep­tion I’ve heard — Stack­less, does not in any way, remove the Global Inter­preter Lock. No sir. It’s still there. Lurk­ing. Wait­ing to steal your candy. Also, it still has a stack, so it’s not truly “stackless”.

So, microthreads are smaller and require less OS hand hold­ing for con­text switch­ing, and ulti­mately can (and are) sched­uled by the inter­preter, rather than the oper­at­ing system.

install note for os/x users: You need to pass the “–enable-stacklessfewerregisters” to con­fig­ure, oth­er­wise, make pukes on you.

Stack­less is a basic imple­men­ta­tion of these — for a sim­ple resource exam­ple usage exam­ple, I wrote a sim­ple script which spawns 2,000 threads (sorry win­dows) and 2,000 tasklets. I watched the mem­ory usage of both:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
import time
import threading
def func():
    time.sleep(120)
 
threads = [threading.Thread(target=func) for i in range(2000)]
for i in threads:
    i.start()
for i in threads:
    i.join()

For the threaded script — the res­i­dent size was 42M and the vir­tual size was 1037. Ver­sus the stack­less version:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
import time
import stackless
 
def func():
    for i in range(120):
        time.sleep(1)
        stackless.schedule()
 
for i in range(2000):
    stackless.tasklet(func)()
 
stackless.run()

Stack­less had a res­i­dent size of 3416K and a vir­tual size of 22M — vir­tu­ally micro­scopic ver­sus the heav­ier thread ver­sion. Obvi­ously, they are not line for line com­par­isons — the Stack­less ver­sion, like other coop­er­a­tive mul­ti­task­ing sys­tems requires that each tasklet be a good cit­i­zen, and not block exe­cu­tion for­ever, instead resched­ul­ing itself or oth­er­wise yield­ing to allow for oth­ers to run. If a tasklet blocks on a socket, every­one blocks on that tasklet.

Some­one asked me to track the lin­ear growth of the threaded num­bers vs. the tasklet num­bers. Since I’m a sucker, I thought I’d take him up on it (OS/X 10.5, 4GB of ram, Core 2 Duo):

Threads:

Num Threads Res­i­dent Size Vir­tual Size
2 3412K 23M
200 7336K 123M
2,000 42M 1037M

Tasklets:

Num Tasklets Res­i­dent Size Vir­tual Size
2 3128K 21M
200 3164K 21M
2,000 3408K 22M
20,000 5920K 24M
100,000 17M 34M

One note — the Stack­less num­bers should be low, but not this low (from my under­stand­ing, and review from oth­ers), any­one have any ideas?

There’s the num­bers — lots of threads is going to con­sume lots of ram. With stack­less, a given tasklet is only a few kilo­bytes in aver­age size and there­fore the mem­ory foot­print is small when you start rais­ing the count. Addi­tion­ally, note the two counts at the bot­tom of the tasklets table; you can’t spawn that many threads (depend­ing on your OS and con­fig­u­ra­tion) and even if you could, the mem­ory foot­print would be costly.

Now, in the age of cheap-ass-ram, where you can trick out a desk­top or server with 16GB sticks, peo­ple might argue “so what” — but on machines where mem­ory is con­strained, such as smaller note­books, embed­ded devices, or game con­soles — this is a crit­i­cal thing to take into consideration.

If you look at the stack­less code, there is another big thing to real­ize; Stack­less like other frame­works or sys­tems which use an sched­uler built into the inter­preter gives you the benefit/task of sched­ul­ing when your tasklets/components/etc exe­cute. This gives you more con­trol, but more respon­si­bil­ity. Stack­less offers both coop­er­a­tive and pre­emp­tive sched­ul­ing, how­ever the pre­emp­tive sched­ul­ing doesn’t feel right. more on sched­ul­ing here

So, we’ve deter­mined that stack­less tasklets are smaller, right? Pretty simple.

If you’ve read the other things I’ve writ­ten on Kamaelia/Twisted/etc, you’ll rec­og­nize the con­cepts within Stack­less pretty quickly — a tasklet is a com­po­nent, a thread of work and tasklets inter­com­mu­ni­cate via chan­nels. For exam­ple, here’s a lit­tle exam­ple of two tasklets communicating:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
import stackless
 
def chicken(channel):
    channel.send('cluck')
 
def egg(channel):
    print channel.receive()
 
channel = stackless.channel()
stackless.tasklet(chicken)(channel)
stackless.tasklet(egg)(channel)
stackless.run()

Pretty easy, and a tiny amount of code. The con­cept of tasklets/microthreads isn’t a new one — in fact, it’s how Erlang gets it’s groove on — Erlang doesn’t use native OS threads, instead, it uses microthreads sched­uled by the Erlang com­piler. How­ever, they are not directly com­pa­ra­ble. Stack­less isn’t run­ning across cores — Erlang does, stack­less, due to the GIL, has to obey the same rules as the rest of python-core. For more on “erlang v. stack­less”, see this.

Oh, and you can share nor­mal object via the chan­nel too:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import stackless
 
def yes(channel):
    x = channel.receive()
    x.append('yes')
    channel.send(x)
 
def no(channel):
    x = channel.receive()
    x.append('no')
    channel.send(x)
 
channel = stackless.channel()
stackless.tasklet(yes)(channel)
stackless.tasklet(no)(channel)
channel.send([])
stackless.run()
print channel.receive()

Mov­ing on, Stack­less offers some­thing else — the abil­ity to pickle tasklets. This means you can pickle up a tasklet and send it over the wire to another machine and then unpickle it and con­tinue run­ning it — chan­nels get pick­led too. Locally, this means to can save it to disk, and then resume state easily.

You could use this to gen­er­ate a tasklet which lis­tened on a port for data on the local machine, and passed the data off the wire to the chan­nel — when you pick­led the chan­nel or the tasklets with that chan­nel in scope, and sent it over the wire, they would pick up lis­ten­ing on the same port num­ber on the remote machine. You loose cur­rent ses­sions, yes, but you could also detect active ses­sions and han­dle those gracefully.

This is nice for say, a com­po­nent you wanted to be able to eas­ily send to other machines to help load bal­anc­ing. In the­ory, you could auto-detect new servers being added to a clus­ter, and when that server came up into a “ready” state, send it the dae­mon it should han­dle — and the tasklets would pick up where they left off (minus the sessions).

Oth­er­wise, pick­ling chan­nels and tasklets could be used for a few things — you have to think of them in terms of corou­tines (here) — you should be able to sus­pend pro­cess­ing state and then sim­ply resume where you left off. If you’ve got python brains — picke-able gen­er­a­tors. You could put them in a data­base; but the use of that escapes me at the moment.

Oh — and pick­led tasklets and chan­nels can be shared amongst dif­fer­ent archi­tec­tures too, as long as that archi­tec­ture is run­ning the same ver­sion of Stack­less, and sup­ports Stackless.

To con­tinue on — if you wanted to add threads into the mix as well — Stack­less tasklets can be run within Python threads, how­ever those tasklets are local to that thread, and each thread gets it’s own sched­uler. Your main appli­ca­tion thread has it’s own sched­uler, and so on.

Stack­less’ tasklet/channel sys­tem is quite nice, how­ever note that I’m not say­ing Stack­less is the only way into this mag­i­cal world — it’s not, espe­cially with a plethora of coroutine/greenlet/etc pack­ages for python today, and the con­tin­ued work towards mak­ing gen­er­a­tors more awe­some. I’m just show­ing what Stack­less can/could do.

The prim­i­tives within Stack­less are nice — frankly, I’d like a light weight green thread imple­men­ta­tion in python core on which we could build a nice Actor library, as well as sup­port the lower mem­ory footprint/etc. How­ever, in order to use these prim­i­tives within Stack­less — you’d find your­self build­ing your own abstrac­tion layer/framework (for exam­ple, con­cur­rence) to really get a lot of mileage out of it. This is why peo­ple run twisted on top of it, CCP Games has the uthread library (which you can see here) and so on.

The cost of a deploy­ment of Stack­less can not be under­es­ti­mated though — it’s got some magic assem­bler code within it, which isn’t the most portable of goods (ver­sus OSes, com­piler ver­sions, com­pil­ers, etc). Some plat­forms sim­ply aren’t sup­ported due to this. Not to men­tion, it’s an entirely new inter­preter, which has a cost much higher than that of an exten­sion module.

A few peo­ple who I’ve been talk­ing with asked me the sim­ple ques­tion — “Why hasn’t any of this been pushed into python-core”. Well, in short — it was never really pro­posed (by Chris­t­ian), and the changes within Stack­less — the last seri­ous dis­cus­sion was from 2007 (see this).

With Stack­less, it’s dif­fi­cult — I think there is a per­ceived com­plex­ity about the code and then there is real com­plex­ity. I sus­pect both of these are high in the case of Stack­less due to the nature of the prob­lem it is try­ing to solve; namely bolt­ing a fea­ture like this onto an inter­preter not meant for it. I think that due to this, and due to Chris­t­ian and oth­ers mov­ing onto the greener pas­tures of PyPy — inclu­sion into core sim­ply won’t happen.

Resources:

  • Richard Tew

    Regard­ing “Some plat­forms sim­ply aren’t sup­ported, or are only par­tially sup­ported due to this.”

    Either plat­forms are sup­ported or they are not. If there is no stack switch­ing code for a given plat­form, then Stack­less will com­pile as nor­mal Python. There is no such thing as par­tial support.

  • jnoller

    I’ve fixed it — I mis­read some­thing, I apologize.

  • fijal

    Note that there is one frus­trat­ing lim­i­ta­tion with threads — on linux, by default you can­not have more than 372 threads.
    period. That’s because each thread con­sumes 8M vir­tual mem­ory and you run out of addresss space.

    Cheers,
    fijal

  • jnoller

    Actu­ally, that’s incor­rect. I know for a fact given I have a machine
    run­ning with over 7325 threads active. I had to raise the ulimit a bit
    because they’re chew­ing on sock­ets. This is on ubuntu hardy heron,
    FWIW.

  • Adam Olsen

    That’s true for a 32-bit box with default stack sizes. Pro­grams using a large num­ber of threads fre­quently use much lower stack sizes (a fac­tor of 100 is pos­si­ble, depend­ing on what the threads do). 64-bit CPUs make this moot how­ever, as they have 64-bit point­ers and cur­rently sup­port up to 48-bits of address space, mean­ing about 25 mil­lion threads.

  • jnoller

    Approved

  • Adam Olsen

    That’s true for a 32-bit box with default stack sizes. Pro­grams using a large num­ber of threads fre­quently use much lower stack sizes (a fac­tor of 100 is pos­si­ble, depend­ing on what the threads do). 64-bit CPUs make this moot how­ever, as they have 64-bit point­ers and cur­rently sup­port up to 48-bits of address space, mean­ing about 25 mil­lion threads.

What's this?

You are currently reading Stackless: You got your coroutines in my subroutines. at jessenoller.com.

meta