This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the June 2008 issue.
A world without a Global Interpreter Lock (GIL) - the very thought of it makes some people very, very happy. At PyCon 2007 Guido openly stated that he would not be against a GIL-less implementation of Python, provided someone coughed up the patch itself. Right now, that someone is Adam Olsen - an amateur programmer who has been working on a patch to the CPython interpreter since July of 2007.
It's PyCon. I'm supposed to be listening to a talk, but I've fallen down the rabbit hole of a future without a global interpreter lock. I'm locked in on getting a patched version of the interpreter up and running on Mac OS/X and the patch author, Adam Olsen, is coaching me through changes to some of the deepest internals of Python itself. For about a year, Adam has been working on the "safe threading" project for Python 3000. In this project, he has attempted to address many of the common issues programmers facing highly threaded and highly concurrent applications. These problems include deadlocks, isolation of shared objects (to prevent corruption/locking issues) and finally, as a side-effect of making threading safer, the removal of the Global Interpreter Lock.
Adam would be the first to point out that adding ''--without-gil'' to the Makefile for the C version of the interpreter was actually a side-effect of the bulk of his work. At 938 kilobytes, I would say his diff against the CPython code base that produces an interpreter with a safe, clear, and concise threading model for local concurrency is a bit more than a side effect.
It is clear that he lives for a concurrent and threaded world, and Adam has filled in a lot of gaps in my knowledge about concurrency in our past conversations. I've been lucky enough to interview him about the safe threading project and his outlook on all of this as well.
First off, what's your background?
I'm an amateur programmer, self taught. I've had a long interest in object models and concurrency, such as how widgets in GTK interact.
I've explored twisted a bit, as well as Python's existing threading. Additionally, I've experimented a great deal with different ways to utilize generators or threads, actors, futures, cooperative versus preemptive scheduling, and so on.
Can you explain the basic premise behind the Safe Threading part of the project?
Make the common uses easy. Don't necessarily make it impossible to get wrong (everything is a tradeoff!), but give the programmer a fighting chance.
How about the "Free Threading" part (--without-gil)?
Everybody seems to know you don't need locking if you're not modifying an object, but Python demands a traceback, not a segfault if the programmer gets it wrong. Monitors and shareability provide a framework that satisfies both.
In essence, removing the GIL was a bonus to avoiding unintended conflicts between threads.
Many people accuse the "threaded programming" paradigm as impossible to get right. Even Brian Goetz has stated that it is extremely hard to "get right". If this is the case, why try to "fix" threading in Python?
What most people see as a problem with threads, I see as a problem with the memory model. They let threads modify the same object simultaneously, resulting in arbitrary, ill-defined results. The complexity here explodes, quickly turning the programmer's brain into mush.
The solution is to isolate most objects. Keep all these mutable objects back in the sequential, single-threaded world. Multiple processes let you do this. Actors do it too. So do monitors.
editor: Actors can be thought of as Objects (for Object Oriented programmers) except that in the Actor model, all Actors execute simultaneously and can create more Actors, maintain their own state, and communicate via asynchronous message passing. A Monitor can be thought of as an object that encapsulates another object intended for use by multiple threads. The Monitor controls all locking semantics for the encapsulated objects.
So from that view, processes, actors, and monitors are all equivalent. The only reason I use monitors and build them on OS threads is that it fits better with the existing Python language and is much more efficient for the way Python uses them. I could take a page from Erlang and call them "processes", but I think in the long run that would be more confusing, not less.
In looking through the diff for safe thread, you've had to touch a lot. Everything from object allocation to the entire way threads are managed. What was the hairiest series of changes you've had to make?
Hard to say - I've been at this nearly a year with still lots to do. I could mention when I found that atomic refcounting didn't scale, it spelled doom for the removal of GIL until I came up with a viable asynchronous scheme. Straightening out object allocation was also pretty nasty, as stock CPython uses a twisted maze of macros and wrapper functions for that.
The worst was probably deciding how to handle class and module dictionaries. What you've got here are mutable objects, inherently used simultaneously by multiple threads, no clear cutoff point after which they're no longer modified, and a massive amount of implicit accesses ingrained as a fundamental part of the language. I really wanted to impose some order on this, add some clear boundaries to when they're modified versus accessed, and make them live up to the "explicit is better than implicit" ideal.
I couldn't do it though. Implicit access is too ingrained into the language. Eventually I conceded defeat, then embraced it, codifying dict's existing API as shareddict's actual API. In doing this I also switched to a relatively simple read/write lock as shareddict's protection (relative to what came before!). In the end, the only restriction was that the contents of shareddict must themselves be shareable.
Isn't a monitor equivalent to adding an @synchronized or a with:lock statement around your code? Is using a monitor for the mutable objects that much faster than lock.acquire and lock.release?
editor: See Listing1.py for a monitor example. You will need to be running Python 3000 with Adam's patch for the code to work.
# See the requirements: You must apply Adam's patch to Python 3000 # for this. from __future__ import shared_module from threadtools import Monitor, monitormethod, branch class Counter(Monitor): """A simple counter, shared between threads""" __shared__ = True # More shared_module boilerplate def __init__(self): self.count = 0 @monitormethod def tick(self): self.count += 1 @monitormethod def value(self): return self.count def work(c): for i in range(20): c.tick() def main(): c = Counter() with branch() as children: for i in range(10): children.add(work, c) print("Number of ticks:", c.value())
Superficially, yes, but it has deeper semantics as well. The biggest is that it imposes a shareability requirement to all objects passed in or out, and there's no way to bypass it. This basically forces you to be explicit about how you expect threads to modify mutable objects.
It also lets me use a bit saner recovery from deadlocks than would be possible using with:lock. Not that much though and there's certain circumstances when there are no ideal ways to recover.
Performance wise, lock.acquire/lock.release are irrelevant. The real competition is with adding a lock to every object, such as a list. What seems like a simple ''if x: x.append(42)'' actually requires 2 acquire/release pairs - something like ''x.extend(y)'' would require a pair for every item in y. This could easily add up to thousands of lock operations where a monitor lets you get away with just one.
How did you handle Garbage Collection?
Painfully. I originally attempted to use simple atomic integer operations for the refcounting, but I found they didn't work. Well, they were correct, but they didn't give me the benefit of removing the GIL. Multiple CPUs/cores would fight over the cache line containing the refcount, slowing everything to a crawl.
I solved that by adding a second mode for refcounting. An object starts out like normal, but once a second thread accesses the refcount it switches to an asynchronous mode. In this mode each thread buffers up their own refcount changes, writing out all the changes to a single refcount at once. Even better, if the net change is 0 it can avoid writing anything at all!
The catch is, you can no longer delete the object when the refcount hits 0, as another thread might have outstanding changes. Instead, I modified the tracing GC to keep a list of all objects and had it occasionally flush the buffers and check for objects with a refcount of 0.
Your Branching-as-children method (see Listing 2) of spawning, implemented in your patch, deviates from the current threading module approach. Why not just overlay your work on the existing API?
with branch() as children: for i in range(10): children.add(work, arg1, arg2)
Branch basically wraps up best practices into a single construct. It propagates exceptions, handles cancellation, lets you pass out return values, and ensures you don't accidentally leave a child thread running after you've returned.
You can still leave threads running after a function returns, you just need to use a branch that's higher up in the call stack. Later on, I might add a built-in one just above the main module just for this purpose.
What about stopping/pausing child threads?
Pausing isn't possible, but cancellation serves the purpose of stopping. Essentially it sets a flag on that thread to tell it to stop, as well as making sure participating I/O functions will check that flag and end themselves promptly.
How did you handle thread-safe imports?
Most of this isn't implemented yet, but the basic idea is that each module will be either shareable or unshareable. Unshareable modules work normally if imported from the main thread, but if another thread tries to import one they won't get past the parsing phase - just enough to try to detect ''from __future__ import shared_module''.
Modules found to be shareable are placed in their own MonitorSpace (the underlying tool used by a Monitor) before the Python code in them is executed. This separates them from the main thread, so I won't need the main thread's cooperation to load them.
In your implementation, you use libatomic-ops, essentially adding a new python build/library dependency - what does this buy you over using standard locking primitives?
Scalability. I can use an atomic read and, so long as the memory doesn't get modified, all the CPUs/cores will pull it into their own cache. If I used a lock it would inherently involve a write as well, meaning only one CPU/core would have it cached at a time.
For some applications it also happens to be a great deal lighter than a lock. It may be both easier to use and faster.
You state on your page about the Dead Lock Fallacy, that "Ultimately, good style and a robust language will produce correct programs, not a language that tries to make it impossible to go wrong." What language tries to make it impossible to go wrong? Why (again) not just ditch threading and move to, say, Erlang?
Concurrent Pascal would be the great old example - they introduce monitors there, but apply a great deal more restrictions as well. Ultimately though, the language is focused on hard real-time applications, and it shows. Python and safethread are focused on general purpose applications, so usability is more important.
Erlang's a pretty similar situation. It was designed for real time, distributed, fault-tolerant applications. It wants you to use one-way messages (not the two-way function call). It copies everything passed in those messages. Good tradeoffs for its focus, but bad for a general purpose language.
What sorts of CPython bugs have you found delving this deep into the codebase?
Just a few scattered little ones. My favorite was a refcounting bug involving dicts, but it could only occur using recursive modification or threading - obviously with shareddict I make the latter a little more likely (but only recursive modification is possible with the normal dict).
Although, that's not including the threading/interpreter state APIs. Most of that code was pretty messy; lots of bugs lurking around. It was quite satisfying to rip it out.
How have your changes altered the API that C extension writers use? Given that the "bonus" of the GIL is a simple interface for extension writers via the Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS macros - does safethread introduce more complexity?
Most of my changes are cleanup and simplification. The tp_alloc/tp_free slots are gone - everything uses PyObject_New and PyObject_Del. PyObject_GC_New/Del are gone too. The old GIL macros are directly replaced by PyState_Suspend()/PyState_Resume().
However, there are new options to take advantage of. Extensions doing I/O or other blocking operations should use the cancellation API. Modules wishing to be shareable should be audited, then apply Py_TPFLAGS_SHAREABLE/METH_SHARED, as appropriate. However, if they do that they also need to call PyArg_RequireShareable/PyArg_RequireShareableReturn if there's the potential to share objects between threads (MonitorSpaces technically).
You don't support the old threading API right now - but would it be possible to add in backwards-compatible support, or is it simply unfeasible?
For the C API, it'd be easy to retain the old GIL macros. Other parts may not be so easy.
For the Python API, some are easy, some aren't possible, and some are just painful.
Adding equivalents to Lock, Semaphore, and Queue is easy. Easier than the originals in fact. Getting all the minor details right (such as, if you subclass it) might be harder/impossible. Lock would not support deadlock detection, but it would be cancelable.
Daemon threads will likely not be supported, but, in my opinion, they're broken by design anyway.
The painful part is resurrecting the GIL, so these "classic" threads can share arbitrary objects like they always did. However, I won't make it so global - they'll acquire/release the main MonitorSpace instead, so all the new-style threads (created using branch()) will not be slowed down.
Finally, you've pointed out that "real threading" does not equal distributed programming, only local concurrency (i.e: support for multiple cores). What do you think Python could do to support distributed computing (providing the GIL-less world comes to fruition)?
At this point, it's confusing. Much of the focus is to work around the GIL, to take advantage of multiple cores. With safethread integrated into Python, I think many of the distributed/multiprocess projects would die off. What's left would be the ones that *really* want to be distributed and need multiple boxes, not multiple cores.
In my mind, there are three main characteristics of distributed programming; although, a given framework, may only use one or two:
- security - you don't trust the other nodes, they don't trust you. This often takes the form of sandboxes on a local box or capability system. - fault-tolerance - a hardware failure on one box should only bring down that box, not every other box connected to it. Upgrading the software of one box at a time should also be possible. - latency - asking another node (even on a LAN) can easily be several orders of magnitude slower than reading from your own RAM or, even better, your cache.
All these lead to different tradeoffs. You really need to minimize the communication between nodes by pushing them apart, whereas safethread is only concerned about making it easier to write correct programs.
The bottom line is that safethread lets you do the easy stuff (local concurrency) so that you only need to do the hard stuff (distributed programming) when you really need it.
Everyone is welcome to download, contribute, and try out Adam's patches. Bug reports, code, emails are all welcome. There are active discussion on the Python 3000 mailing list about all of this and more suggestions are welcome.
In all, there is a lot of interest in Adam's work. There was a lot of discussion around concurrency, threads, and the GIL at Pycon this year, and with Python 3000 coming down the pipe with the "multicore future" looming, things are getting interesting.