Python Threads and the Global Interpreter Lock

February 1st, 2009 § 22 comments

There are a plethora of mech­a­nisms and tech­nolo­gies sur­round­ing con­cur­rent pro­gram­ming — Python has sup­port for many of them. In this arti­cle we will explain, exam­ine, and bench­mark Python’s thread­ing sup­port, and dis­cuss the much maligned Global Inter­preter Lock (GIL).

This is a reprint of a fea­tured arti­cle I wrote for Python Mag­a­zine that was pub­lished in the Decem­ber 2007 issue. This arti­cle assisted in inspir­ing me to write PEP 371.

Intro­duc­tion

For many years, and espe­cially within the last year, there has been an ongo­ing dis­cus­sion about the con­cepts of con­cur­rent and par­al­lel pro­gram­ming, and how Python as a lan­guage might fit into both of these. The dis­cus­sion ranges from com­plaints about the global inter­preter lock, to a dis­cus­sion on the valid­ity of threaded pro­gram­ming in general.

Blog posts, open let­ters, and arti­cles belea­guer the fact that Python and other lan­guages are not truly “con­cur­rent” and can not scale to tens of cores on the mod­ern proces­sor. They all dis­cuss the myr­iad selec­tion of pro­gram­ming par­a­digms, solu­tions (both old and new) and com­par­isons to other languages.

Of spe­cial note, given that this is a Python mag­a­zine, is the dis­cus­sion around the valid­ity of Python within a highly par­al­lel and con­cur­rent envi­ron­ment, due to the cur­rent struc­ture of the CPython inter­preter, the global inter­preter lock, and a lack of “Erlang-like” concurrency.

What is con­cur­rency? That’s an easy thing to answer. Con­cur­rency, when applied to application/program logic, is the simul­ta­ne­ous exe­cu­tion of tasks. For the most part, these tasks inter­act and pass infor­ma­tion to and from each other and the par­ent. Gen­er­ally speak­ing, any­thing worth doing is, sooner or later, worth doing concurrently.

While con­cur­rency defines the prob­lem (or rather, the the­ory behind a solu­tion), par­al­lelism defines the imple­men­ta­tion. Par­al­lelism, when applied to appli­ca­tions, gen­er­ally refers to the actual imple­men­ta­tion of con­cur­rent pro­gram­ming. Specif­i­cally, it refers to the simul­ta­ne­ous exe­cu­tion of tasks in such a way as to take advan­tage of mul­ti­ple proces­sors, mul­ti­ple proces­sor cores, or even mul­ti­ple machines within a com­put­ing grid.

There are, as with all things in com­put­ing, a myr­iad selec­tion of “solu­tions” or par­a­digms to address the chal­lenges of both con­cur­rent and par­al­lel com­put­ing. These include options rang­ing from the focus of this arti­cle (stan­dard thread­ing or pthreads) to micro-threads, asyn­chro­nous pro­gram­ming (a la Twisted), process fork­ing, et cetera. Some lan­guages, like Erlang, eschew many of these to build the con­cept of con­cur­rency and com­mu­ni­ca­tion deep into the lan­guage itself, mak­ing it a fun­da­men­tal truth and assumption.

Each one of these paradigms,technologies, or solu­tions has it’s own pros, cons, and intri­ca­cies that have to be under­stood before you as a pro­gram­mer can really choose one to use. Ques­tions have to be answered:

  • Do you need shared state?
  • Do you need to scale across machines/clusters?
  • Are you will­ing to use message-passing, IPC, or mem­ory mapping?

Most peo­ple won’t, or don’t have to, answer these ques­tions, as they will sim­ply approach the prob­lem with what is now the near ubiq­ui­tous solu­tion to the con­cur­rency “prob­lem”: Threads, and threaded programming.

A brief descrip­tion of Threads

A Thread is sim­ply an agent spawned by the appli­ca­tion to per­form work inde­pen­dent of the par­ent process. While the term Thread and thread­ing have referred to the con­cept of spawn­ing (fork­ing) mul­ti­ple processes, more fre­quently they refer specif­i­cally to a pthread, or a worker which is spawned by the par­ent process, and which shares that parent’s resources.

Both processes and threads are cre­ated within a given pro­gram­ming lan­guage and then sched­uled to run, either by the inter­preter itself (com­monly known as “green threads”), or by the oper­at­ing sys­tem (“native threads”). Threads which are sched­uled by the oper­at­ing sys­tem are gov­erned by the oper­at­ing system’s sched­uler, which dic­tates many things. Among them is the usage and allo­ca­tion of mul­ti­ple proces­sor resources for the exe­cu­tion of the child threads.

Now, what’s the dif­fer­ence between a thread and a process which you can cre­ate if both are sim­ply work­ers spawned by the par­ent and sched­uled by the oper­at­ing sys­tem or inter­preter? Threads fun­da­men­tally dif­fer from processes in that they are light weight and share mem­ory. The term “light weight” refers to the up-front cost of per­form­ing the oper­at­ing sys­tem level process cre­ation (and the require­ment of pass­ing a large amount of infor­ma­tion and state to the spawned process). Shar­ing mem­ory speaks to one of the ben­e­fits of using threads over processes. Namely, threads can share state, objects, and other infor­ma­tion with each other and the main thread of exe­cu­tion (this is nor­mally called shared con­text). Threads, as they live within the space of a sin­gle process, share all of the parent’s address space.

The very fact that threads share the same address space is one of their key and most lauded fea­tures. When a pro­gram spawns twenty threads, all twenty of those threads can have unfet­tered access to data within both them­selves, each other, and the par­ent. This “sim­pli­fies” data shar­ing; every­one has everything.

With shared mem­ory you side­step most, if not all, of the inter-child com­mu­ni­ca­tions ques­tions and issues. The catch? You sac­ri­fice your abil­ity to eas­ily jump from a sin­gle mem­ory space on one machine to many machines (with­out a lot of work, that is).

Threads, just like processes, are designed to be passed off to the oper­at­ing sys­tem sched­uler, which dic­tates which CPU a given process is run on, when it is run on that CPU which means that not only do you gain shared mem­ory with threads, you can lever­age all of the local machine’s resources (in the­ory) includ­ing mul­ti­ple cores and/or CPUs.

Threads are also, as men­tioned before, ubiq­ui­tous. Most mod­ern lan­guages, along with some of the old crusty ones like, say, C (I kid!) have sup­port for threads. Java has the con­cur­rency pack­age, Ruby has the thread struct, Perl, well perl has some thing(s). C, C++, etc., they all sup­port threads.

Another ben­e­fit of threaded pro­gram­ming, which is fre­quently under­stated, is orga­ni­za­tion and design. Take,for exam­ple, any sys­tem which fol­lows the clas­si­cal Producer/Consumer model, in which you have a group of pro­ducer threads which fill a shared queue with work items and the con­sumer threads work off of that shared queue, con­cur­rently or asyn­chro­nously. Threads are a nat­ural fit for this pat­tern because they lend them­selves to the clear and clean encap­su­la­tion of logic for all of the objects, while also lever­ag­ing the shared nature of threads them­selves via the work queue. This separation/encapsulation of logic (and duties) within threaded appli­ca­tions can fre­quently make them eas­ier to read, main­tain, and understand.

Great, right?

Well, not every­thing is happy-fun-time in Thread land. Many a great pro­gram­mer has been stymied by the one thing that brought peo­ple run­ning to Thread’s dinner-table: Shared Mem­ory. Take, for instance, the syn­chro­niza­tion error.

Syn­chro­niza­tion errors occur when you have two threads access­ing a com­mon muta­ble chunk of data, whether it’s a func­tion that in turn mod­i­fies or calls some­thing else, or some­thing like a shared dic­tio­nary where the two attempt to alter the same key at once (more on this later). Syn­chro­niza­tion errors are gen­er­ally hang­ing out behind the school smok­ing with… Race con­di­tions! Take the exam­ple in List­ing 1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/python
from threading import Thread
 
class myObject(object):
    def __init__(self):
        self._val = 1
    def get(self):
        return self._val
    def increment(self):
        self._val += 1
 
def t1(ob):
    ob.increment()
    print 't1:', ob.get() == 2
 
def t2(ob):
    ob.increment()
    print 't2:', ob.get() == 2
 
ob = myObject()
 
# Create two threads modifying the same ob instance
thread1 = Thread(target=t1, args=(ob,))
thread2 = Thread(target=t2, args=(ob,))
 
# Run the threads
thread1.start()
thread2.start()
thread1.join()
thread2.join()

The banal­ity of the code notwith­stand­ing, one of the two tests will always fail. This is because both threads are incre­ment­ing an unlocked, global value. This is a sim­ple exam­ple of a syn­chro­niza­tion error, one of the threads beats the other to access a shared resource, and the asser­tion fails. Errors such as this are com­mon. Alarm­ingly so, given that when peo­ple begin explor­ing threaded pro­gram­ming, their first instinct is to share every­thing (whereas many con­cur­rency solu­tions are referred to as “shared-nothing”).

This banal exam­ple shows a fun­da­men­tal truth when applied to threaded pro­gram­ming in gen­eral: with great shared mem­ory comes great respon­si­bil­ity. It’s so easy to have threads throw­ing around objects and data that pro­gram­mers fre­quently over share. Threaded pro­gram­ming requires that the pro­gram­mer be very, very mind­ful of what data needs to be shared amongst work­ers, how to pro­tect (i.e. lock) that data so that you do not run into the dreaded race-condition (as the code above would) or the even more dread­ful Deadlock.

A quick note for those of you unfa­mil­iar with the con­cept: dead­locks occur when your appli­ca­tion seizes up while any num­ber of threads lock-up wait­ing for the other threads to free required resources, never to return. Dead­locks and syn­chro­niza­tion errors in threaded appli­ca­tions are not only painful to debug, they’re also insid­i­ous, hard to find, and easy to make.

The rise of threaded pro­gram­ming in mod­ern lan­guages really brings con­cur­rent pro­grams to every­one. They are the com­modi­ti­za­tion of con­cur­rency and par­al­lelism; every­one can (and sooner or later does) write a pro­gram with threads. The mod­ern thread is the de facto con­cur­rent pro­gram­ming solu­tion to many mod­ern pro­gram­mers. They have truly become ubiquitous.

Every­one who writes appli­ca­tions (or even a sim­ple script) will write an appli­ca­tion that does more than one thing at once, whether it’s cal­cu­lat­ing prime num­bers or down­load­ing pic­tures of cats on the inter­net. And sooner or later, they’re being yelled at by their spouse to come to bed because they’re up at 3 am and have been chas­ing syn­chro­niza­tion issues for the past twelve hours. 3 am is also when dead­locks jump out of the closet and lock up your appli­ca­tions, right after your 3 month old spits up on your keyboard.

Python Thread Support

To quote the Python thread mod­ule documentation:

“The design of this mod­ule is loosely based on Java’s thread­ing model. How­ever, where Java makes locks and con­di­tion vari­ables basic behav­ior of every object, they are sep­a­rate objects in Python. Python’s Thread class sup­ports a sub­set of the behav­ior of Java’s Thread class. Cur­rently, there are no pri­or­i­ties, no thread groups, and threads can­not be destroyed, stopped, sus­pended, resumed, or inter­rupted. The sta­tic meth­ods of Java’s Thread, when imple­mented, are mapped to module-level functions.”

Python’s thread sup­port, obvi­ously, is loosely based off of Java’s (some­thing many a Java-to-Python con­vert has been happy for, and stymied by). Python’s thread imple­men­ta­tion is very sim­ple at it’s core, and builds eas­ily on the con­cept of indi­vid­u­al­ized work­ers and shared state. It pro­vides all of the threaded pro­gram­ming prim­i­tives: locks, sem­a­phores, and all the nor­mal good­ies that come with a good threaded implementation.

Let’s look at the sim­plest thread example:

1
2
3
4
5
6
7
8
from threading import Thread
 
def myfunc():
    print "hello, world!"
 
thread1 = Thread(target=myfunc)
thread1.start()
thread1.join()

There, you’ve got a multi-threaded appli­ca­tion in seven lines of code. This lit­tle script has two threads: the main thread, and the ”thread1” object we cre­ated. Obvi­ously, here, I’m not shar­ing any­thing. Heck, I’m hardly doing any­thing at all! What we are doing is sim­ple: I am cre­at­ing a new Thread object and pass­ing it a func­tion to run. We are then call­ing ”start()”, which sends the thread off to be exe­cuted. The ”join()” method blocks the main appli­ca­tion exe­cu­tion until the thread that we’ve called ”join()” on exits, this pre­vents a generic “poll the thread until it is done” loop one might oth­er­wise construct.

Python’s library has “two” threading-related mod­ules: thread and thread­ing. Two is in quo­ta­tion marks because really it’s only got one that most peo­ple would care about, and that’s thread­ing. The thread­ing mod­ule builds on the prim­i­tives from the thread mod­ule that we men­tioned ear­lier. Thread­ing is built using thread, and there are few, if any, rea­sons for most devel­op­ers to use the thread mod­ule directly.

Let’s take a moment to revisit the state­ment that Python’s thread sup­port is a Java-Like imple­men­ta­tion. Let’s set a basic Python exam­ple next to a sim­ple Java exam­ple. First Java,

MyThread.java:

1
2
3
4
5
6
package com.demo.threads;
public class MyThread implements Runnable {
    public void run() {
        System.out.println("I am a thread!");
    }
}

MyThreadDemo.java:

1
2
3
4
5
6
7
8
package com.demo.threads;
public class MyThreadDemo {
    public static void main( String[] args ) {
        MyThread foobar = new MyThread();
        Thread1 = new Thread(foobar);
        Thread1.start();
    }
}

Now, let’s do the same thing, with Python:

1
2
3
4
5
6
7
8
9
from threading import Thread
 
class MyThread(Thread):
    def run(self):
        print "I am a thread!"
 
foobar = MyThread()
foobar.start()
foobar.join()

As you can see, the two are not that far off (albeit the Python exam­ple is shorter). Both MyThread classes extend a super­class Runnable or Thread. These super­classes pro­vide the basic inter­faces which dic­tate that the object is a valid “thread object”. In both cases, we can skip over defin­ing the con­struc­tor for the object, as it’s already been han­dled by the super­class (but we could make a new one if we wanted to).

Both MyThread objects have a com­mon method, though: ”run()”. When mak­ing an object a thread this is the only method which really needs to be defined. Once a valid ”run()” method has been defined, the object is now thread­able. A thread object may have any num­ber of meth­ods, call­backs, dec­o­ra­tors, etc. We have, in wan­der­ing around the Python Cheese­shop, seen some mod­ules sim­ply extend thread so that an object can sup­port threaded imple­men­ta­tions out of the box. The imple­men­ta­tion that’s pro­vided may not be threaded itself.

Now, as men­tioned before, syn­chro­niza­tion errors plague threaded pro­grams. Unless access to a shared resource has proper con­trols (locks) applied to it, mul­ti­ple threads access­ing the same resource will con­tend for that resource. Con­tention leads to dead­locks, dead­locks lead to suf­fer­ing, and all that jazz.

Let’s take a look at another sim­ple exam­ple com­monly used to explain thread­ing — the Bank exam­ple, in List­ing 2. This is a sim­ple thing really; the Bank is a shared object which is manip­u­lated simul­ta­ne­ously by the threads we spawn. In this exam­ple, we spawn a Bank with 100 accounts, and a 1000 “dol­lar” bal­ance. This means that the Bank should only ever con­tain $100,000. The bank accounts should never lose money, nor should they ever gain money above what was in the accounts to begin with.

List­ing 2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from threading import Thread
from operator import add
import random
 
class Bank(object):
    def __init__(self, naccounts, ibalance):
        self._naccounts = naccounts
        self._ibalance = ibalance
 
        self.accounts = []
        for n in range(self._naccounts):
            self.accounts.append(self._ibalance)
 
    def size(self):
        return len(self.accounts)
 
    def getTotalBalance(self):
        return reduce(add, self.accounts)
 
    def transfer(self, name, afrom, ato, amount):
 
        if self.accounts[afrom] < amount: return
 
        self.accounts[afrom] -= amount
        self.accounts[ato] += amount
 
        print "%-9s %8.2f from %2d to %2d Balance: %10.2f" % \
            (name, amount, afrom, ato, self.getTotalBalance())
 
 
class transfer(Thread):
    def __init__(self, bank, afrom, maxamt):
        Thread.__init__(self)
        self._bank = bank
        self._afrom = afrom
        self._maxamt = maxamt
    def run(self):
        for i in range(0, 3000):
            ato = random.choice(range(b.size()))
            amount = round((self._maxamt * random.random()), 2)
            self._bank.transfer(self.getName(), self._afrom, ato, amount)
 
naccounts = 100
initial_balance = 1000
 
b = Bank(naccounts, initial_balance)
 
threads = []
for i in range(0, naccounts):
    threads.append(transfer(b, i, 100))
    threads[i].start()
 
for i in range(0, naccounts):
    threads[i].join()

In the nor­mal exam­ple, the ”for i in range(0, 3000):” would be a ”while True:”, but we don’t need for­ever, baby. Go ahead, run it. You’ll notice some­thing within a few thou­sand trans­ac­tions: the Bank bal­ance goes awry. It may not always hap­pen, though. I’ve run this suc­cess­fully four times in a row now with zero cor­rup­tion. How­ever, the fifth time is the charm:

...
Thread-88     1.26 from 87 to 41 Balance:  100000.00
Thread-88     0.65 from 87 to 23 Balance:  100000.00
Thread-88     0.47 from 87 to  8 Balance:  100000.00
Thread-88     0.04 from 87 to 28 Balance:  100000.00
Thread-83    30.14 from 82 to  9 Balance:   99948.90
Thread-83    35.70 from 82 to 67 Balance:   99948.90
Thread-80     0.11 from 79 to 98 Balance:   99948.90
Thread-83    30.82 from 82 to 89 Balance:   99948.90
Thread-83    40.51 from 82 to 95 Balance:   99948.90
Thread-83     5.20 from 82 to 52 Balance:   99948.90
...

That’s the beauty of syn­chro­niza­tion errors and race con­di­tions. They won’t always hap­pen. They may not even hap­pen fifty per­cent of the time. But they will hap­pen. They good news? the fix is a few lines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from threading import Thread, Lock
...
lock = Lock()
class bank(object):
    def __init__(self, naccounts, ibalance):
...
    def transfer(self, name, afrom, ato, amount):
        if self.accounts[afrom] < amount: return
        lock.acquire()
        try:
            self.accounts[afrom] -= amount
            self.accounts[ato] += amount
        finally:
            lock.release()

There. Prob­lem fixed! See, Python, much like Java, allows the bro­ken ver­sion of this exam­ple to have one thread per­form­ing the account decre­ment while another per­forms the incre­ment. Both threads are in the same code block of the shared object at once, and while one has already decre­mented an account another thread may be decre­ment­ing the same account again.

Wait what? What did we do? I’ve added a Lock, one of those nice fun­da­men­tals that thread­ing sup­port needs. “Wrap­ping” a resource (whether it is a vari­able, method or some other object) in a lock as I’ve done above makes exe­cu­tion of that resource (in this case the accounts being mod­i­fied) thread safe, which is to say mul­ti­ple threads can be run­ning, but only one can be mod­i­fy­ing that par­tic­u­lar resource at a time. In order to run the code, a thread has to first get the lock.

Locks and their man­age­ment are the fix for the prob­lem thread­ing intro­duces when access­ing shared mem­ory and objects. But don’t think all’s well that ends well; lock man­age­ment can also drive you bonkers.

So, Python has all of the nice bits of thread­ing, right? Well, one thing a Java pro­gram­mer might be miss­ing is the magic of the ”syn­chro­nized” key­word. In Java, the ”syn­chro­nized” key­word, applied to a method or vari­able, turns access to that method, or mod­i­fi­ca­tion of the vari­able, into an instantly thread safe action.

For exam­ple, the Java ver­sion of the trans­fer code:

1
2
3
4
5
public synchronized void transfer(int afrom, 
                                  int ato, 
                                  double amount) {
	...
}

The ”syn­chro­nized” key­word basi­cally man­ages the locks and con­di­tions automag­i­cally. When the method is entered, the lock is acquired. When it is exited, the lock is released. The good news for Python is that with dec­o­ra­tors we can at least fudge out a syn­chro­nized key­word, so instead of pep­per­ing our code with lock calls, we can do this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from threading import Lock
from __future__ import with_statement
def synchronized():
    the_lock = Lock()
    def fwrap(function):
        def newFunction(*args, **kw):
            with the_lock:
                return function(*args, **kw)
        return newFunction
    return fwrap
 
...
    @synchronized()
    def transfer(self, name, afrom, ato, amount):
        if self.accounts[afrom] < amount: return
...

Now, in this code exam­ple we used a bit of Python 2.5: the ”with” state­ment, which allows you to con­dense some amount of ”try”/”except”/”finally” code via con­text man­agers. Since the inner work­ings of ”with” are out of scope, take a look at PEP 343 (http://docs.python.org/whatsnew/pep-343.html).

For easy recipes with imple­men­ta­tions of syn­chro­nized dec­o­ra­tors see:

So, now that you’ve got all of the basic lock sup­port, and mag­i­cal dec­o­ra­tors, and you’ve been warned to care­fully think about what to share, all’s fun and games and you’re going to run off and promptly write an appli­ca­tion with 1000 threads and do cool stuff, right? Well, there’s some­thing else you need to know about: the Global Inter­preter Lock.

The Global Inter­preter Lock

Nev­er­the­less, you’re right the GIL is not as bad as you would ini­tially think: you just have to undo the brain­wash­ing you got from Win­dows and Java pro­po­nents who seem to con­sider threads as the only way to approach con­cur­rent activ­i­ties.” Guido Van Rossum http://mail.python.org/pipermail/python-3000/2007-May/007414.html

This is the part of the arti­cle where we take some air out of your threaded sails. It’s unfor­tu­nate this respon­si­bil­ity must be mine, but let us not dwell on it. First let me describe what the GIL is.

The GIL is an interpreter-level lock. This lock pre­vents exe­cu­tion of mul­ti­ple threads at once in the Python inter­preter. Each thread that wants to run must wait for the GIL to be released by the other thread, which means your multi-threaded Python appli­ca­tion is essen­tially sin­gle threaded, right? Yes. Not exactly. Sort of.

CPython uses what’s called “oper­at­ing sys­tem” threads under the cov­ers, which is to say each time a request to make a new thread is made, the inter­preter actu­ally calls into the oper­at­ing system’s libraries and ker­nel to gen­er­ate a new thread. This is the same as Java, for exam­ple. So in mem­ory you really do have mul­ti­ple threads and nor­mally the oper­at­ing sys­tem con­trols which thread is sched­uled to run. On a mul­ti­ple proces­sor machine, this means you could have many threads spread across mul­ti­ple proces­sors, all hap­pily chug­ging away doing work.

How­ever, while CPython does use oper­at­ing sys­tem threads (in the­ory allow­ing mul­ti­ple threads to exe­cute within the inter­preter simul­ta­ne­ously), the inter­preter also forces the GIL to be acquired by a thread before it can access the inter­preter and stack and can mod­ify Python objects in mem­ory all willy-nilly. The lat­ter point is why the GIL exists: The GIL pre­vents simul­ta­ne­ous access to Python objects by mul­ti­ple threads. But this does not save you (as illus­trated by the Bank exam­ple) from being a lock-sensitive crea­ture; you don’t get a free ride. The GIL is there to pro­tect the inter­preters mem­ory, not your sanity.

The GIL also keeps garbage col­lec­tion (the rea­son you don’t have to worry about mem­ory man­age­ment, you bum) work­ing. It pre­vents one thread from decre­ment­ing the coun­ters for an object and let­ting the object go into the ether while another object is work­ing with that object. Python’s garbage col­lec­tion (deal­lo­cat­ing unused objects to free mem­ory) uti­lizes the con­cept of ref­er­ence count­ing. This is where all ref­er­ences to a given object (inte­ger, string or ”YourCat(object)”) are tracked. When the num­ber of ref­er­ences reaches zero, the object is deleted. The GIL pre­vents any two threads from decre­ment­ing the ref­er­ence count to any object to 0 while another thread is work­ing on that object. Remem­ber, only one thread can access a Python object at a time.

In real­ity, the CPython GIL is designed with an eye towards a sim­pler inter­preter imple­men­ta­tion and sin­gle threaded exe­cu­tion speed. It makes the main­te­nance of the inter­preter (and by def­i­n­i­tion, the writ­ing of exten­sion mod­ules) eas­ier by remov­ing the need to worry about many mem­ory man­age­ment and con­cur­rency issues that oth­er­wise might be prob­lem­atic to main­tain­ers and exten­sion authors. It keeps the ref­er­ence inter­preter sim­ple. It is not, ulti­mately, an “end user” fea­ture unless you’re writ­ing C code for Python.

Python has had thread­ing sup­port, and the GIL, since as far back as ver­sion 1.5, so it’s not new. In 1999 Greg Stein cre­ated a patch set for the inter­preter that removed the GIL, but added gran­u­lar lock­ing around sen­si­tive inter­preter oper­a­tions. This patch set had the direct effect of speed­ing up threaded exe­cu­tion, but made sin­gle threaded exe­cu­tion two times slower.

So you may be won­der­ing, “if we have the GIL, and a thread must own it to exe­cute within the inter­preter, what decides if the GIL should be released?” Deli­cious byte code instruc­tions. When a Python appli­ca­tion is exe­cuted, it is com­piled down to byte code, the actual instruc­tions that the inter­preter uses for exe­cu­tion of the appli­ca­tion. Nor­mally, byte code files end with a name like “.pyc” or “.pyo”. A given line of a Python appli­ca­tion might be a sin­gle byte code, while oth­ers, such as an import state­ment, may ulti­mately expand into many byte code instruc­tions for the interpreter.

That all being said, the CPython inter­preter, when work­ing with pure Python code (more on this in a moment) will force the GIL to be released every hun­dred byte code instruc­tions. This means that if you have a com­plex line of code like a com­plex math func­tion that in real­ity acts as a sin­gle byte code the GIL will not be released for the period that that state­ment takes to run.

There is an excep­tion though: C mod­ules! C exten­sion mod­ules (and built in C mod­ules) can be built in such a way that they release the GIL vol­un­tar­ily and do their own magic. Take for instance the time mod­ule (”timemodule.c” in the Python source tree). The ”sleep()” func­tion looks some­thing like this:

1
2
3
4
5
6
 
    ...
    Py_BEGIN_ALLOW_THREADS
        sleep((int)secs);
    Py_END_ALLOW_THREADS
    ....

In a C exten­sion, the ”Py_BEGIN_ALLOW_THREADS” and ”Py_END_ALLOW_THREADS” macros sig­nal the inter­preter and basi­cally state “hey, I’m enter­ing some block­ing oper­a­tion, here’s the GIL back” and “hey, I’m return­ing, I need the GIL”. This means that any­thing in your appli­ca­tion that uses a block­ing I/O func­tion (network/socket manip­u­la­tion, file manip­u­la­tion) or a thread-safe C exten­sion (most of the built-in ones are) can “bypass” the GIL. This means you can get closer to hav­ing mul­ti­ple threads run­ning at concurrently.

Take for a moment, the ”timemodule.c” code we pasted above. This means that if you have a threaded appli­ca­tion, and want the GIL to be released reg­u­larly by your threads, you can call ”time.sleep(.0001)” or some other tiny amount, and the GIL will be mag­i­cally released, and your other thread(s) will run. Most appli­ca­tion devel­op­ers wouldn’t like this solu­tion, but it is a com­mon “work around” for the GIL limitation.

There are other macros and a lot more details about the C API and the GIL. The newer macros ”PyGILState_STATE_Ensure” and ”PyGILState_STATE_Release” do all of the low level state and GIL manip­u­la­tion for you. we rec­om­mend read­ing sec­tion 8.1 of the Python/C API Ref­er­ence Man­ual. http://docs.python.org/api/threads.html

From a pro­gram­ming stand­point, the GIL is equiv­a­lent to wrap­ping all of your code in a ”syn­chro­nize” key­word (with­out the mem­ory safety). No two threads can run at once, they can only seem to via GIL acquisition/releasing tricks.

There are other ways to accel­er­ate the GIL manip­u­la­tion or avoid it:

- call ”time.sleep()”
– set ”sys.setcheckinterval()”
– run Python in opti­mized mode
– dump process-intensive tasks into C-extensions
– use the sub­process mod­ule to exe­cute commands

The fact is, the GIL does pre­vent you as a pro­gram­mer from using mul­ti­ple CPUs simul­ta­ne­ously. Python as a lan­guage, how­ever, does not. If the CPython inter­preter had the GIL removed, the oper­at­ing system’s pthread sys­tem would be free to run every­thing in par­al­lel. The GIL does not pre­vent a process from run­ning on a dif­fer­ent proces­sor of a machine. It sim­ply only allows one thread to run at once within the interpreter.

The real ques­tion you have to ask your­self is: does the GIL actu­ally affect you and your appli­ca­tion? Is it really harm­ing you or is it sim­ply a con­ve­nient excuse for peo­ple to dis­miss Python? Let’s exam­ine code and numbers.

Bench­mark­ing Python Threads

Let’s get our hands dirty with more code. First, a word on these bench­marks: Bench­marks are handy excuses for peo­ple to get offended, or to tweak them to make an argu­ment. I do not intend either of these with these num­bers; run the code your­self and make your own assumptions.

All tests are being run on a Mac­Book Pro, 2.33 GHz Intel Core 2 Duo with 3GB of ram and a 7200 RPM hard drive run­ning Leop­ard and CPython 2.5.1 from http://www.python.org. For the process tests, we’re using the pro­cess­ing mod­ule, pre­vi­ously cov­ered in this mag­a­zine. http://pypi.python.org/pypi/processing/

The fol­low­ing tests will call a given func­tion one time, in one hun­dred loops. I then dis­play the best of 100 calls. I cycle between a non-threaded iter­a­tion based call, a threaded call, and finally a pro­cess­ing mod­ule (fork and exec) call. I iter­ate each test for an increas­ing num­ber of calls/threads. I will go through 1, 2, 4 and finally 8 calls/threads/processes.

Why are we adding process results into this? The answer is sim­ple: we know in advance the GIL is going to penal­ize our exe­cu­tion, and using it side­steps the GIL entirely, allow­ing us to see how fast exe­cu­tion could be.

For non-threaded exe­cu­tion, to keep things fair, we sim­ply call the defined func­tion sequen­tially the same num­ber of times as threads/processes we would oth­er­wise use. All exam­ples use new-style classes to fur­ther even the play­ing field, although we skip explicit ”__init__()” setup as it’s not needed.

To keep things sim­ple, I’ve del­e­gated all of the tim­ings in this to Python’s timeit mod­ule. The timeit mod­ule is designed to bench­mark pieces of Python code — gen­er­ally sin­gle state­ments. In our case how­ever, it pro­vides some nice facil­i­ties for loop­ing over a given func­tion a set num­ber of times and return­ing to us the fastest run time.

You can see the tim­ing exe­cu­tion script in List­ing 3. In one argu­ment, he script accepts a mod­ule to look into for ”function_to_run()”, so for each one we just save the exam­ple func­tion to a file. It then iter­ates over the tests and dis­plays the results. This setup allows us to swap out the test func­tion easily.

List­ing 3:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#!/usr/bin/python
 
from threading import Thread
from processing import Process
 
class threads_object(Thread):
    def run(self):
        function_to_run()
 
class nothreads_object(object):
    def run(self):
        function_to_run()
 
class process_object(Process):
    def run(self):
        function_to_run()
 
def non_threaded(num_iter):
    funcs = []
    for i in range(int(num_iter)):
        funcs.append(nothreads_object())
    for i in funcs:
        i.run()
 
def threaded(num_threads):
    funcs = []
    for i in range(int(num_threads)):
        funcs.append(threads_object())
    for i in funcs:
        i.start()
    for i in funcs:
        i.join()
 
def processed(num_processes):
    funcs = []
    for i in range(int(num_processes)):
        funcs.append(process_object())
    for i in funcs:
        i.start()
    for i in funcs:
        i.join()
 
def show_results(func_name, results):
    print "%-23s %4.6f seconds" % (func_name, results)
 
if __name__ == "__main__":
    import sys
    from timeit import Timer
 
    repeat = 100
    number = 1
 
    num_threads = [ 1, 2, 4, 8 ]
 
    if len(sys.argv) < 2:
        print 'Usage: %s module_name' % sys.argv[0]
        print '  where module_name contains a function_to_run function'
        sys.exit(1)
    module_name = sys.argv[1]
    if module_name.endswith('.py'):
        module_name = module_name[:-3]
    print 'Importing %s' % module_name
    m = __import__(module_name)
    function_to_run = m.function_to_run
 
    print 'Starting tests'
    for i in num_threads:
        t = Timer("non_threaded(%s)" % i, "from __main__ import non_threaded")
        best_result = min(t.repeat(repeat=repeat, number=number))
        show_results("non_threaded (%s iters)" % i, best_result)
 
        t = Timer("threaded(%s)" % i, "from __main__ import threaded")
        best_result = min(t.repeat(repeat=repeat, number=number))
        show_results("threaded (%s threads)" % i, best_result)
 
        t = Timer("processed(%s)" % i, "from __main__ import processed")
        best_result = min(t.repeat(repeat=repeat, number=number))
        show_results("processes (%s procs)" % i, best_result)
        print "\n",
 
    print 'Iterations complete'

Test one estab­lishes some base­line num­bers by exe­cut­ing an empty func­tion. This will show us the over­head asso­ci­ated with each of the mech­a­nisms I am test­ing as applied to object cre­ation and execution.

1
2
def function_to_run():
    pass

Results of the code above:

non_threaded (1 iters)  0.000003 seconds
threaded (1 threads)    0.010256 seconds
processes (1 procs)     0.004803 seconds

non_threaded (2 iters)  0.000007 seconds
threaded (2 threads)    0.020478 seconds
processes (2 procs)     0.012630 seconds

non_threaded (4 iters)  0.000010 seconds
threaded (4 threads)    0.040831 seconds
processes (4 procs)     0.010525 seconds

non_threaded (8 iters)  0.000017 seconds
threaded (8 threads)    0.080949 seconds
processes (8 procs)     0.017513 seconds

So, we can see that we take a hit just to add the threads and process spawn­ing to the calls. This is expected; besides the fact that Python is opti­mized for sin­gle threaded per­for­mance, the sim­ple act of cre­at­ing the threads and sub­processes adds an up-front cost. Look at the first group of results — the threaded call has a much higher cost than the non-threaded ver­sion. Also of inter­est is the fact that the cost of adding threads/runs grows par­al­lel to the num­ber of threads added — 8 threads tak­ing 0.080949, 4 threads tak­ing 0.040831, and so on.

Keep in mind that the point of adding threads is not to speed up ini­tial­iza­tion, but rather to add con­cur­rency to the appli­ca­tion. In a less con­trived exam­ple, we might cre­ate a pool of threads once and reuse the work­ers, allow­ing us to split up a large data set and run the same func­tion over dif­fer­ent parts (a.k.a: the Producer/Consumer model). So while this is not the norm for con­cur­rent appli­ca­tions, these tests are designed to be simple.

One com­mon appli­ca­tion of threaded (and non-threaded) pro­gram­ming is to do num­ber crunch­ing. Let’s take a sim­ple, brute force approach to doing Fibonacci num­ber crunch­ing, not­ing of course that we’re not shar­ing state here. I am just try­ing to have two tasks gen­er­ate a set num­ber of Fibonacci numbers.

1
2
3
4
5
\
def function_to_run():
  a, b = 0, 1
  for i in range(100000):
    a, b = b, a + b

Results of the code above:

non_threaded (1 iters)  0.276594 seconds
threaded (1 threads)    0.280199 seconds
processes (1 procs)     0.290740 seconds

non_threaded (2 iters)  0.559094 seconds
threaded (2 threads)    0.564981 seconds
processes (2 procs)     0.299791 seconds

non_threaded (4 iters)  1.117339 seconds
threaded (4 threads)    1.133981 seconds
processes (4 procs)     0.580096 seconds

non_threaded (8 iters)  2.235245 seconds
threaded (8 threads)    2.275226 seconds
processes (8 procs)     1.159978 seconds

As you can see from this data, adding the addi­tional threads doesn’t buy us any­thing — you would expect that the threaded exam­ples would run in par­al­lel — but we can see that the threaded exam­ple runs at the same speed (slightly slower) than the single-threaded exam­ples. Adding threads in this case actively harms us. The func­tion exe­cuted is pure Python, and due to the sim­ple over­head of cre­at­ing the threads and the GIL, the threaded runs could never be faster than the sin­gle threaded or process-based exam­ples. Again, remem­ber that the GIL only allows one thread to be active in the inter­preter at any given time.

Now, let’s do an I/O-bound task, like read­ing 1000 1-kilobyte chunks off of ”/dev/urandom”!

1
2
3
4
def function_to_run():
    fh = open("/dev/urandom", "rb")
    for i in range(1000):
        fh.read(1024)

Results of the code above:

non_threaded (1 iters)  0.125532 seconds
threaded (1 threads)    0.125908 seconds
processes (1 procs)     0.140314 seconds

non_threaded (2 iters)  0.251784 seconds
threaded (2 threads)    0.250818 seconds
processes (2 procs)     0.261338 seconds

non_threaded (4 iters)  0.503835 seconds
threaded (4 threads)    0.501558 seconds
processes (4 procs)     0.511969 seconds

non_threaded (8 iters)  1.006956 seconds
threaded (8 threads)    1.003003 seconds
processes (8 procs)     1.009011 seconds

We’re start­ing to see threads pass by the sin­gle threaded approach with the file I/O task, but not by much. How­ever, it is at least neck-and-neck with the sin­gle threaded imple­men­ta­tion and faster than the pro­cess­ing exam­ple. The lat­ter is an inter­est­ing point, too. This means that if you can “dodge” the GIL you can poten­tially hit process speeds.

Keep in mind that you would not really use threads like the bench­mark script we’ve put together does. Gen­er­ally speak­ing, you’d be append­ing things to a queue, tak­ing them off, and doing other shared-state tasks. Hav­ing a bunch of threads off run­ning the same func­tion, while use­ful, is not a com­mon use-case for a con­cur­rent pro­gram, unless you’re split­ting up and pro­cess­ing large data sets.

For a quick final exam­ple let’s look at the results when using the socket mod­ule. This is the mod­ule that all net­work I/O goes through, it’s in C, and it’s thread safe. To exclude net­work latency issues we will con­nect to another computer’s Apache web server on the LAN (not opti­mized for load) and we’ll use urllib2 instead of the raw socket library — urllib2 uses the socket library under the cov­ers. Grab­bing URLs is a com­mon enough use case, rather than just con­nect­ing to a socket over and over. I will also lower the count of the requests since, gen­er­ally speak­ing, jack­ham­mer­ing your web server makes your web server the bot­tle­neck. Given that this one is not tuned, we will keep it sim­ple. All that is being pulled down from Apache is the default welcome-page.

1
2
3
4
5
def function_to_run():
    import urllib
    for i in range(10):
        f = urllib.urlopen("http://10.0.1.197")
        f.read()

Results of the code above:

non_threaded (1 iters)  0.123033 seconds
threaded (1 threads)    0.121244 seconds
processes (1 procs)     0.141433 seconds

non_threaded (2 iters)  0.250751 seconds
threaded (2 threads)    0.223357 seconds
processes (2 procs)     0.242443 seconds

non_threaded (4 iters)  0.486189 seconds
threaded (4 threads)    0.438107 seconds
processes (4 procs)     0.448466 seconds

non_threaded (8 iters)  0.986121 seconds
threaded (8 threads)    0.881546 seconds
processes (8 procs)     0.859714 seconds

Take a look at the last two sets of num­bers, minus the pro­cess­ing examples:

non_threaded (4 iters)  0.486189 seconds
threaded (4 threads)    0.438107 seconds

non_threaded (8 iters)  0.986121 seconds
threaded (8 threads)    0.881546 seconds

As you can see, doing I/O does, in fact, release the GIL. The threaded exam­ples are obvi­ously get­ting faster than the single-threaded exe­cu­tion. Given that most appli­ca­tions do per­form a cer­tain amount of I/O (most appli­ca­tions spend a large amount of their time in I/O) the GIL does not pre­vent users from cre­at­ing multi-threaded appli­ca­tions that can act in a con­cur­rent man­ner and add speed to their applications.

Does the GIL block those peo­ple who are work­ing in pure Python from truly tak­ing advan­tage of mul­ti­ple cores? Sim­ply put: Yes, it does. While threads them­selves are a lan­guage con­struct, the inter­preter is the gate­keeper to the map­ping between threads and the OS. This is why Jython and Iron­Python have no GIL. It was sim­ply not needed and left out of the imple­men­ta­tion of the interpreter.

Obvi­ously, based on the num­bers above, switch­ing to processes side-steps the entire GIL issue, allow­ing all of the chil­dren to run con­cur­rently. It’s some­thing to think about and it’s been pointed out quite a bit.

In Sum­mary

Threaded pro­gram­ming is the con­cur­rency solu­tion for the “com­mon” man, but the prob­lems one runs into when delv­ing into threaded pro­gram­ming are not for the faint of heart, and they are not easy to over­come. Thread­ing is the first solu­tion to the con­cur­rency prob­lem that many peo­ple run to when they first get the urge to begin doing tasks in parallel.

Python itself has good thread­ing sup­port, includ­ing all of the lock­ing prim­i­tives, queues, events and sem­a­phores. That’s every­thing Java and many other lan­guages have, includ­ing some higher-level “cool” thread things. Can CPython take advan­tage of mul­ti­ple threads for con­cur­rency? Yes, with caveats. The caveats applied ham­per a par­tic­u­lar seg­ment of appli­ca­tion devel­op­ers for sure, but for most of us work­ing in high I/O envi­ron­ments, CPython’s thread sys­tem with the GIL works out fine. Even in those envi­ron­ments though, threads may not be the fastest option.

The GIL is some­thing that’s been debated, and con­tin­ues to be debated amongst many. Some call it a fea­ture, some call it a mis­take. Some peo­ple will be quick to say that threaded pro­gram­ming is too dif­fi­cult for a large swath of the pop­u­la­tion, and that, in and of itself is a true statement.

An impor­tant point to remem­ber: The GIL is an inter­preter issue. This means that, again, other inter­preters, such as Jython and Iron­Python do not suf­fer the “penalty” of the GIL. In the same vein, there are a few peo­ple out there cur­rently work­ing with the Python code base to exper­i­ment with the removal of the GIL in the CPython interpreter.

Guido (the BDFL) has already expressed open­ness to accept­ing a patch set to the CPython tree that could option­ally enable or dis­able the GIL or, if some enter­pris­ing indi­vid­ual wanted to, to imple­ment the inter­preter in such a way as to remove the GIL entirely with­out sac­ri­fic­ing sin­gle threaded performance.

There con­tin­ues to be a large swath of peo­ple that state that threads are not a true solu­tion to the con­cur­rency prob­lem, despite many hun­dreds of thou­sands of threaded appli­ca­tions cur­rently in pro­duc­tion. These peo­ple yearn for some­thing cleaner, less prone to the dark and sor­did syn­chro­niza­tion prob­lems too much shared-state brings to the table.

Thread­ing is ubiq­ui­tous, but it is only a sin­gle solu­tion for con­cur­rency and we hope that this arti­cle helps you think about it’s pit­falls, poten­tial and its state in the Python language.

For more thoughts on the GIL, see Guido’s blog (read the com­ments): http://www.artima.com/weblogs/index.jsp?blogger=guido

For another excel­lent com­par­i­son of threads/processes/etc., see the recent eff­bot post­ing on the Tim Bray wide finder project: http://effbot.org/zone/wide-finder.htm

Note, for a producer/consumer thread­ing exam­ple, see the threading.py module’s _test() method.

  • Orestis Markou

    Hm, that syn­chro­nised imple­men­ta­tion seems wrong to me — the top level func­tion doesn’t receive any argu­ment. You would have to call it to use it, which makes sense, since you want to cre­ate the lock once per dec­la­ra­tion. Still a bit nasty unfortunately.

    @synchronised()
    def some­thing():
    pass

  • jnoller

    Eh, I reprinted it in all it’s orig­i­nal glory. Maybe I’ll start an adden­dum for it :)

  • Pingback: On Concurrency, Threads, and Python - Standard Deviations

  • Pingback: Mailund on the Internet » Blog Archive » Python threads

  • Orestis Markou

    BTW, great arti­cle! Sorry for only point­ing out the mistake!

  • Ser­gio

    Very nice arti­cle about thread­ing. The only thing I think is miss­ing here to check GIL per­for­mance impact, would be to have the same tests exe­cuted using Jython and IronPython.

  • Eric LEBIGOT (EOL)

    Love the style and the clar­ity of your paper! Thanks!

  • http://blog.thecapacity.org jay

    Thanks for the arti­cle, but down­load­ing the list­ings seems bro­ken.
    I did; wget http://jessenoller.com/wp-content/plugins/wp-co…

    and got;

    Fatal error: Call to unde­fined func­tion: add_action() in /home/jnoller/public_html/wp-content/plugins/wp-codebox/wp-codebox.php on line 47

  • Pingback: links for 2009-02-04 « My Weblog

  • http://www.example.com a pedant

    There are a plethora of mech­a­nisms and tech­nolo­gies…” Plethora does not mean “a lot of,” even though it’s fre­quently mis­used that way. See dictionary.com, but it is a neg­a­tive term refer­ring to a super­abun­dance of some­thing. It’s some­thing exces­sive. While your sen­tence could be stretched to that, you clearly were not com­ment­ing on there being too many tech­niques. A minor nit in an oth­er­wise very good arti­cle. Thanks!

  • jnoller

    there’s a bug in the plu­gin, since I don’t have time to mon­key with it right now, I switched the down­load to a view code link

  • jnoller

    Huh. Okie doke.

  • Pingback: Mailund on the Internet » Blog Archive » This week in the blogs

  • http://www.andreagrandi.it Andrea Grandi

    You say “The banal­ity of the code notwith­stand­ing, one of the two tests will always fail”. It’s not complete/exact: both can be FALSE. I run this code on my machine and I get a FALSE / FALSE every 20–30 times.

  • Sal

    As a fairly new Python user, I found your arti­cle to be very help­ful. It helped me under­stand the pros and cons of thread­ing in Python and put con­cur­rency into con­text. Thank you.

  • Chris

    Great arti­cle; thank you. I hope to read more on this issue so keep up the posts!

  • Masci

    Thanks for your arti­cle, I found it clear and fun and I think it could be a pre­cious resource for those who visit Thread Land for the first time ;-)
    Jesse, any objec­tion if I trans­late it in italian?

  • http://kedeligdata.blogspot.com/ Mads

    Excel­lent arti­cle! I was won­der­ing why my stack­less python code would not let me uti­lize my multi-core CPU fully, and your arti­cle explains it with great clar­ity. Thank you.

  • http://kedeligdata.blogspot.com/ Mads

    Excel­lent arti­cle! I was won­der­ing why my stack­less python code would not let me uti­lize my multi-core CPU fully, and your arti­cle explains it with great clar­ity. Thank you.

  • Pingback: I thread Python ed il Global Interpreter Lock « MasciBlog

  • Pingback: Can a Python runtime environment be multithreaded? - Quora

  • Pingback: Tier 1 Sell-Side Language War: Slang vs Python vs Scala « Tales from a Trading Desk

What's this?

You are currently reading Python Threads and the Global Interpreter Lock at jessenoller.com.

meta