An Interview With Adam Olsen, Author of Safe Threading | Completely Different

February 2nd, 2009 § 8 comments

This is a reprint of an arti­cle I wrote for Python Mag­a­zine as a Com­pletely Dif­fer­ent col­umn that was pub­lished in the June 2008 issue.

A world with­out a Global Inter­preter Lock (GIL) — the very thought of it makes some peo­ple very, very happy. At PyCon 2007 Guido openly stated that he would not be against a GIL-less imple­men­ta­tion of Python, pro­vided some­one coughed up the patch itself. Right now, that some­one is Adam Olsen — an ama­teur pro­gram­mer who has been work­ing on a patch to the CPython inter­preter since July of 2007.

It’s PyCon. I’m sup­posed to be lis­ten­ing to a talk, but I’ve fallen down the rab­bit hole of a future with­out a global inter­preter lock. I’m locked in on get­ting a patched ver­sion of the inter­preter up and run­ning on Mac OS/X and the patch author, Adam Olsen, is coach­ing me through changes to some of the deep­est inter­nals of Python itself.

For about a year, Adam has been work­ing on the “safe thread­ing” project for Python 3000. In this project, he has attempted to address many of the com­mon issues pro­gram­mers fac­ing highly threaded and highly con­cur­rent appli­ca­tions. These prob­lems include dead­locks, iso­la­tion of shared objects (to pre­vent corruption/locking issues) and finally, as a side-effect of mak­ing thread­ing safer, the removal of the Global Inter­preter Lock.

Adam would be the first to point out that adding ”–without-gil” to the Make­file for the C ver­sion of the inter­preter was actu­ally a side-effect of the bulk of his work. At 938 kilo­bytes, I would say his diff against the CPython code base that pro­duces an inter­preter with a safe, clear, and con­cise thread­ing model for local con­cur­rency is a bit more than a side effect.

It is clear that he lives for a con­cur­rent and threaded world, and Adam has filled in a lot of gaps in my knowl­edge about con­cur­rency in our past con­ver­sa­tions. I’ve been lucky enough to inter­view him about the safe thread­ing project and his out­look on all of this as well.

First off, what’s your background?

I’m an ama­teur pro­gram­mer, self taught. I’ve had a long inter­est in object mod­els and con­cur­rency, such as how wid­gets in GTK interact.

I’ve explored twisted a bit, as well as Python’s exist­ing thread­ing. Addi­tion­ally, I’ve exper­i­mented a great deal with dif­fer­ent ways to uti­lize gen­er­a­tors or threads, actors, futures, coop­er­a­tive ver­sus pre­emp­tive sched­ul­ing, and so on.

Can you explain the basic premise behind the Safe Thread­ing part of the project?

Make the com­mon uses easy. Don’t nec­es­sar­ily make it impos­si­ble to get wrong (every­thing is a trade­off!), but give the pro­gram­mer a fight­ing chance.

How about the “Free Thread­ing” part (–without-gil)?

Every­body seems to know you don’t need lock­ing if you’re not mod­i­fy­ing an object, but Python demands a trace­back, not a seg­fault if the pro­gram­mer gets it wrong. Mon­i­tors and share­abil­ity pro­vide a frame­work that sat­is­fies both.

In essence, remov­ing the GIL was a bonus to avoid­ing unin­tended con­flicts between threads.

Many peo­ple accuse the “threaded pro­gram­ming” par­a­digm as impos­si­ble to get right. Even Brian Goetz has stated that it is extremely hard to “get right”. If this is the case, why try to “fix” thread­ing in Python?

What most peo­ple see as a prob­lem with threads, I see as a prob­lem with the mem­ory model. They let threads mod­ify the same object simul­ta­ne­ously, result­ing in arbi­trary, ill-defined results. The com­plex­ity here explodes, quickly turn­ing the programmer’s brain into mush.

The solu­tion is to iso­late most objects. Keep all these muta­ble objects back in the sequen­tial, single-threaded world. Mul­ti­ple processes let you do this. Actors do it too. So do monitors.

edi­tor: Actors can be thought of as Objects (for Object Ori­ented pro­gram­mers) except that in the Actor model, all Actors exe­cute simul­ta­ne­ously and can cre­ate more Actors, main­tain their own state, and com­mu­ni­cate via asyn­chro­nous mes­sage pass­ing. A Mon­i­tor can be thought of as an object that encap­su­lates another object intended for use by mul­ti­ple threads. The Mon­i­tor con­trols all lock­ing seman­tics for the encap­su­lated objects.

So from that view, processes, actors, and mon­i­tors are all equiv­a­lent. The only rea­son I use mon­i­tors and build them on OS threads is that it fits bet­ter with the exist­ing Python lan­guage and is much more effi­cient for the way Python uses them. I could take a page from Erlang and call them “processes”, but I think in the long run that would be more con­fus­ing, not less.

In look­ing through the diff for safe thread, you’ve had to touch a lot. Every­thing from object allo­ca­tion to the entire way threads are man­aged. What was the hairi­est series of changes you’ve had to make?

Hard to say — I’ve been at this nearly a year with still lots to do. I could men­tion when I found that atomic ref­count­ing didn’t scale, it spelled doom for the removal of GIL until I came up with a viable asyn­chro­nous scheme. Straight­en­ing out object allo­ca­tion was also pretty nasty, as stock CPython uses a twisted maze of macros and wrap­per func­tions for that.

The worst was prob­a­bly decid­ing how to han­dle class and mod­ule dic­tio­nar­ies. What you’ve got here are muta­ble objects, inher­ently used simul­ta­ne­ously by mul­ti­ple threads, no clear cut­off point after which they’re no longer mod­i­fied, and a mas­sive amount of implicit accesses ingrained as a fun­da­men­tal part of the lan­guage. I really wanted to impose some order on this, add some clear bound­aries to when they’re mod­i­fied ver­sus accessed, and make them live up to the “explicit is bet­ter than implicit” ideal.

I couldn’t do it though. Implicit access is too ingrained into the lan­guage. Even­tu­ally I con­ceded defeat, then embraced it, cod­i­fy­ing dict’s exist­ing API as shareddict’s actual API. In doing this I also switched to a rel­a­tively sim­ple read/write lock as shareddict’s pro­tec­tion (rel­a­tive to what came before!). In the end, the only restric­tion was that the con­tents of shared­dict must them­selves be shareable.

Isn’t a mon­i­tor equiv­a­lent to adding an @synchronized or a with:lock state­ment around your code? Is using a mon­i­tor for the muta­ble objects that much faster than lock.acquire and lock.release?

edi­tor: See Listing1.py for a mon­i­tor exam­ple. You will need to be run­ning Python 3000 with Adam’s patch for the code to work.

List­ing 1:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# See the requirements: You must apply Adam's patch to Python 3000
# for this.
from __future__ import shared_module
from threadtools import Monitor, monitormethod, branch
 
class Counter(Monitor):
    """A simple counter, shared between threads"""
    __shared__ = True  # More shared_module boilerplate
    def __init__(self):
        self.count = 0
 
    @monitormethod
    def tick(self):
        self.count += 1
 
    @monitormethod
    def value(self):
        return self.count
 
def work(c):
    for i in range(20):
        c.tick()
 
def main():
    c = Counter()
 
    with branch() as children:
        for i in range(10):
            children.add(work, c)
 
    print("Number of ticks:", c.value())

Super­fi­cially, yes, but it has deeper seman­tics as well. The biggest is that it imposes a share­abil­ity require­ment to all objects passed in or out, and there’s no way to bypass it. This basi­cally forces you to be explicit about how you expect threads to mod­ify muta­ble objects.

It also lets me use a bit saner recov­ery from dead­locks than would be pos­si­ble using with:lock. Not that much though and there’s cer­tain cir­cum­stances when there are no ideal ways to recover.

Per­for­mance wise, lock.acquire/lock.release are irrel­e­vant. The real com­pe­ti­tion is with adding a lock to every object, such as a list. What seems like a sim­ple ”if x: x.append(42)” actu­ally requires 2 acquire/release pairs — some­thing like ”x.extend(y)” would require a pair for every item in y. This could eas­ily add up to thou­sands of lock oper­a­tions where a mon­i­tor lets you get away with just one.

How did you han­dle Garbage Collection?

Painfully. I orig­i­nally attempted to use sim­ple atomic inte­ger oper­a­tions for the ref­count­ing, but I found they didn’t work. Well, they were cor­rect, but they didn’t give me the ben­e­fit of remov­ing the GIL. Mul­ti­ple CPUs/cores would fight over the cache line con­tain­ing the ref­count, slow­ing every­thing to a crawl.

I solved that by adding a sec­ond mode for ref­count­ing. An object starts out like nor­mal, but once a sec­ond thread accesses the ref­count it switches to an asyn­chro­nous mode. In this mode each thread buffers up their own ref­count changes, writ­ing out all the changes to a sin­gle ref­count at once. Even bet­ter, if the net change is 0 it can avoid writ­ing any­thing at all!

The catch is, you can no longer delete the object when the ref­count hits 0, as another thread might have out­stand­ing changes. Instead, I mod­i­fied the trac­ing GC to keep a list of all objects and had it occa­sion­ally flush the buffers and check for objects with a ref­count of 0.

Your Branching-as-children method (see List­ing 2) of spawn­ing, imple­mented in your patch, devi­ates from the cur­rent thread­ing mod­ule approach. Why not just over­lay your work on the exist­ing API?

List­ing 2:

?View Code PYTHON
1
2
3
with branch() as children:
    for i in range(10):
        children.add(work, arg1, arg2)

Branch basi­cally wraps up best prac­tices into a sin­gle con­struct. It prop­a­gates excep­tions, han­dles can­cel­la­tion, lets you pass out return val­ues, and ensures you don’t acci­den­tally leave a child thread run­ning after you’ve returned.

You can still leave threads run­ning after a func­tion returns, you just need to use a branch that’s higher up in the call stack. Later on, I might add a built-in one just above the main mod­ule just for this purpose.

What about stopping/pausing child threads?

Paus­ing isn’t pos­si­ble, but can­cel­la­tion serves the pur­pose of stop­ping. Essen­tially it sets a flag on that thread to tell it to stop, as well as mak­ing sure par­tic­i­pat­ing I/O func­tions will check that flag and end them­selves promptly.

How did you han­dle thread-safe imports?

Most of this isn’t imple­mented yet, but the basic idea is that each mod­ule will be either share­able or unshare­able. Unshare­able mod­ules work nor­mally if imported from the main thread, but if another thread tries to import one they won’t get past the pars­ing phase — just enough to try to detect ”from __future__ import shared_module”.

Mod­ules found to be share­able are placed in their own Mon­i­tor­Space (the under­ly­ing tool used by a Mon­i­tor) before the Python code in them is exe­cuted. This sep­a­rates them from the main thread, so I won’t need the main thread’s coop­er­a­tion to load them.

In your imple­men­ta­tion, you use libatomic-ops, essen­tially adding a new python build/library depen­dency — what does this buy you over using stan­dard lock­ing primitives?

Scal­a­bil­ity. I can use an atomic read and, so long as the mem­ory doesn’t get mod­i­fied, all the CPUs/cores will pull it into their own cache. If I used a lock it would inher­ently involve a write as well, mean­ing only one CPU/core would have it cached at a time.

For some appli­ca­tions it also hap­pens to be a great deal lighter than a lock. It may be both eas­ier to use and faster.

You state on your page about the Dead Lock Fal­lacy, that “Ulti­mately, good style and a robust lan­guage will pro­duce cor­rect pro­grams, not a lan­guage that tries to make it impos­si­ble to go wrong.” What lan­guage tries to make it impos­si­ble to go wrong? Why (again) not just ditch thread­ing and move to, say, Erlang?

Con­cur­rent Pas­cal would be the great old exam­ple — they intro­duce mon­i­tors there, but apply a great deal more restric­tions as well. Ulti­mately though, the lan­guage is focused on hard real-time appli­ca­tions, and it shows. Python and safethread are focused on gen­eral pur­pose appli­ca­tions, so usabil­ity is more important.

Erlang’s a pretty sim­i­lar sit­u­a­tion. It was designed for real time, dis­trib­uted, fault-tolerant appli­ca­tions. It wants you to use one-way mes­sages (not the two-way func­tion call). It copies every­thing passed in those mes­sages. Good trade­offs for its focus, but bad for a gen­eral pur­pose language.

What sorts of CPython bugs have you found delv­ing this deep into the codebase?

Just a few scat­tered lit­tle ones. My favorite was a ref­count­ing bug involv­ing dicts, but it could only occur using recur­sive mod­i­fi­ca­tion or thread­ing — obvi­ously with shared­dict I make the lat­ter a lit­tle more likely (but only recur­sive mod­i­fi­ca­tion is pos­si­ble with the nor­mal dict).

Although, that’s not includ­ing the threading/interpreter state APIs. Most of that code was pretty messy; lots of bugs lurk­ing around. It was quite sat­is­fy­ing to rip it out.

How have your changes altered the API that C exten­sion writ­ers use? Given that the “bonus” of the GIL is a sim­ple inter­face for exten­sion writ­ers via the Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS macros — does safethread intro­duce more complexity?

Most of my changes are cleanup and sim­pli­fi­ca­tion. The tp_alloc/tp_free slots are gone — every­thing uses PyObject_New and PyObject_Del. PyObject_GC_New/Del are gone too. The old GIL macros are directly replaced by PyState_Suspend()/PyState_Resume().

How­ever, there are new options to take advan­tage of. Exten­sions doing I/O or other block­ing oper­a­tions should use the can­cel­la­tion API. Mod­ules wish­ing to be share­able should be audited, then apply Py_TPFLAGS_SHAREABLE/METH_SHARED, as appro­pri­ate. How­ever, if they do that they also need to call PyArg_RequireShareable/PyArg_RequireShareableReturn if there’s the poten­tial to share objects between threads (Mon­i­tor­Spaces technically).

You don’t sup­port the old thread­ing API right now — but would it be pos­si­ble to add in backwards-compatible sup­port, or is it sim­ply unfeasible?

For the C API, it’d be easy to retain the old GIL macros. Other parts may not be so easy.

For the Python API, some are easy, some aren’t pos­si­ble, and some are just painful.

Adding equiv­a­lents to Lock, Sem­a­phore, and Queue is easy. Eas­ier than the orig­i­nals in fact. Get­ting all the minor details right (such as, if you sub­class it) might be harder/impossible. Lock would not sup­port dead­lock detec­tion, but it would be cancelable.

Dae­mon threads will likely not be sup­ported, but, in my opin­ion, they’re bro­ken by design anyway.

The painful part is res­ur­rect­ing the GIL, so these “clas­sic” threads can share arbi­trary objects like they always did. How­ever, I won’t make it so global — they’ll acquire/release the main Mon­i­tor­Space instead, so all the new-style threads (cre­ated using branch()) will not be slowed down.

Finally, you’ve pointed out that “real thread­ing” does not equal dis­trib­uted pro­gram­ming, only local con­cur­rency (i.e: sup­port for mul­ti­ple cores). What do you think Python could do to sup­port dis­trib­uted com­put­ing (pro­vid­ing the GIL-less world comes to fruition)?

At this point, it’s con­fus­ing. Much of the focus is to work around the GIL, to take advan­tage of mul­ti­ple cores. With safethread inte­grated into Python, I think many of the distributed/multiprocess projects would die off. What’s left would be the ones that *really* want to be dis­trib­uted and need mul­ti­ple boxes, not mul­ti­ple cores.

In my mind, there are three main char­ac­ter­is­tics of dis­trib­uted pro­gram­ming; although, a given frame­work, may only use one or two:

- secu­rity — you don’t trust the other nodes, they don’t trust you. This often takes the form of sand­boxes on a local box or capa­bil­ity sys­tem.
fault-tolerance — a hard­ware fail­ure on one box should only bring down that box, not every other box con­nected to it. Upgrad­ing the soft­ware of one box at a time should also be pos­si­ble.
latency — ask­ing another node (even on a LAN) can eas­ily be sev­eral orders of mag­ni­tude slower than read­ing from your own RAM or, even bet­ter, your cache.

All these lead to dif­fer­ent trade­offs. You really need to min­i­mize the com­mu­ni­ca­tion between nodes by push­ing them apart, whereas safethread is only con­cerned about mak­ing it eas­ier to write cor­rect programs.

The bot­tom line is that safethread lets you do the easy stuff (local con­cur­rency) so that you only need to do the hard stuff (dis­trib­uted pro­gram­ming) when you really need it.

Con­clu­sion

Every­one is wel­come to down­load, con­tribute, and try out Adam’s patches. Bug reports, code, emails are all wel­come. There are active dis­cus­sion on the Python 3000 mail­ing list about all of this and more sug­ges­tions are welcome.

In all, there is a lot of inter­est in Adam’s work. There was a lot of dis­cus­sion around con­cur­rency, threads, and the GIL at Pycon this year, and with Python 3000 com­ing down the pipe with the “mul­ti­core future” loom­ing, things are get­ting interesting.

Related Links

  • Stan

    A very inter­est­ing read, but it sounds like python-safethread is already dead, accord­ing to the front page:

    Note: As an imple­men­ta­tion, python-safethread is dead. It is not worth the effort to con­tinue rewrit­ing CPython to have a true trac­ing GC.

    How­ever, the seman­tics pre­sented are still viable and I intend to reuse them in future projects.”

    Does this mean we will have to wait for PyPy in the far future to gen­er­ate a “CPython” (of course, not *the* CPython) imple­men­ta­tion with­out a GIL?

  • Adam Olsen

    Wait for PyPy or wait for me to do an alter­nate imple­men­ta­tion. Hard to say which’ll be first. ;)

  • Jospeh Lisee

    This is sad, I was really hop­ing for your work to make it into core. Maybe you should of (still should?) try for a branch in offi­cial python SVN so your work could of (still could?) been inte­grated bet­ter and worked on by more Python devs.

    Yes, I have a multi-threaded appli­ca­tion and pythons GIL cop-out requires me to use too much C++ code. I also don’t think things like mul­ti­pro­cess­ing are the answer.

  • Adam Olsen

    Unfor­tu­nately, my GC changes didn’t pan out. They looked promis­ing on my old dual core lap­top, but very dis­ap­point­ing on my newer quad core desktop.

    I almost started sep­a­rat­ing the seman­tics of branch­ing and mon­i­tors from the GC and GIL removal, but it’s not nearly as inter­est­ing. Most peo­ple want per­for­mance, not safety and ease of use.

  • http://lazypython.blogspot.com Alex

    Isn’t one pos­si­ble solution(and I know it’s been dis­cussed on python-dev) replac­ing the ref count­ing with a gen­er­a­tional GC, since this means you don’t need all those locks? If yes could this hypo­thet­i­cally be han­dled as a diff out­side of svn just as safe-threading was?

  • Adam Olsen

    The locks aren’t the issue — con­tention of atomic ref­count­ing is.

    Chang­ing the core to use a trac­ing GC (note: a gen­er­a­tional GC is just one vari­ant), not built on ref­count­ing, is pos­si­ble. The issue is that, for an exact GC, it’d be a rather mas­sive diff (as safethread is). The result­ing code would also be rather ugly. Between the two there isn’t enough impe­tus to do it.

    A con­ser­v­a­tive GC, such as the Boehm-Demers-Weiser GC, would require less changes. How­ever, retain­ing the Py_INCREF/Py_DECREF API for third-party exten­sions would be harder, and there’s a risk of com­pat­i­bil­ity issues (point­ers stored in untracked mem­ory, etc).

  • http://lazypython.blogspot.com Alex

    Isn’t one pos­si­ble solution(and I know it’s been dis­cussed on python-dev) replac­ing the ref count­ing with a gen­er­a­tional GC, since this means you don’t need all those locks? If yes could this hypo­thet­i­cally be han­dled as a diff out­side of svn just as safe-threading was?

  • Adam Olsen

    The locks aren’t the issue — con­tention of atomic ref­count­ing is.

    Chang­ing the core to use a trac­ing GC (note: a gen­er­a­tional GC is just one vari­ant), not built on ref­count­ing, is pos­si­ble. The issue is that, for an exact GC, it’d be a rather mas­sive diff (as safethread is). The result­ing code would also be rather ugly. Between the two there isn’t enough impe­tus to do it.

    A con­ser­v­a­tive GC, such as the Boehm-Demers-Weiser GC, would require less changes. How­ever, retain­ing the Py_INCREF/Py_DECREF API for third-party exten­sions would be harder, and there’s a risk of com­pat­i­bil­ity issues (point­ers stored in untracked mem­ory, etc).

What's this?

You are currently reading An Interview With Adam Olsen, Author of Safe Threading | Completely Different at jessenoller.com.

meta