YAML ain't Markup Language | Completely Different

by jesse in , ,


When someone says "pick a markup language," most people would immediately respond with "XML!", but there's an alternative out there. YAML is human-readable, easy to use, and overall quite fantastic.

This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the December 2008 issue. I have republished this in its original form, bugs and all

I hate markup languages. There, I said it. The first time I had the pleasure of "using" (being abused by) XML, I said to myself "there has got to be a better way of doing this." Well, after years of sticking with plain text ini files and custom syntaxes based off of using ''eval()'', I've come to not only use, but love, YAML.

YAML, or "YAML Ain't Markup Language", is "a human friendly data serialization standard for all programming languages." It has the advantage of leaning towards dynamic languages a la Python, Ruby, etc.

It is important to note that friendliness and readability are very core to the design of YAML. The number of format characters is very low and, like Python, YAML's markup can use whitespace to indicate scoping of items. Tabs are not allowed, so there is no chance for confusion about indention level. Additionally, the constructs within YAML such as mappings, sequences, and scalars all mesh nicely with existing Python data types like dictionaries, lists, strings, and integers. It's also fully unicode-enabled, which should make happy a lot of people who are normally worried about UTF-8.

What really attracted me to YAML are some of the key things that drew me to Python: cleanliness and approachability. Too often, I've had to deal with monstrous XML files for data passing or -- worse yet -- configuration and sometimes ini-style configuration files that simply don't scale, or communicate enough information. So far, I've used YAML in about six different projects with great success and found that it scales quite well while staying human-readable.

Syntax is Key

YAML, on its face, is amazingly simple. Take the code below, for example. Run through the pyyaml load function (more on PyYAML in a moment):

 # YAML
name: Jesse

This YAML will get the following Python dictionary:

>>> import yaml
>>> yaml.load("""
...  # YAML
... name: Jesse
... """)
{'name': 'Jesse'}
>>>

This is a simple example. Line 1 of the YAML file, or document, is a simple comment. Note that there is a space character right before that # sign. The next line is a simple key value pair which, after being parsed, gets returned to us in a Python dictionary. Simple as pie!

A simple name-value pair is easy to do. Here is a document with some additional structures and details to try:

 # YAML
object:
    attributes:
        - attr1
        - attr2
        - attr3
    methods: [ getter, setter ]

Here, we have defined a top-level entity named "object". This object has two block mappings related to it, ''attributes'' and ''methods''. The ''attributes'' mapping uses the more verbose YAML syntax for a list, in this case:

attributes:
    - attr1
    - attr2
    - attr3

In this case the YAML represents a key with a name of ''attributes'' while each item underneath it, prefaced with a "''-''", represents an item that will appear in a list as a value for that key. Here it is printed after a load:

{'object': {'attributes': ['attr1', 'attr2', 'attr3'], ...

The ''methods'' key uses YAML shorthand to accomplish the same thing. In my experience, non-programmers tend to understand the first method, "''-''" prefacing, a bit more than the second method. Both parse to Python lists:

{'object': {'attributes': ['attr1', 'attr2',
                           'attr3'],
            'methods': ['getter', 'setter']}}

I included both examples to illustrate a point. Most of YAML's syntax has two ways of achieving the same intended goal. There is the verbose, multi-line method, and the more compact method. Both methods are human-readable, so choosing one is a matter of personal preference.

As you can see, the most basic syntax is as follows:

dicts/hashes: key, value separated by a colon and space, e.g. ''key: value''; additionally, you can use ''{key: value}''

lists: dash followed by a space then the item, e.g. ''- item''; additionally, you can use ''[item, item, item]''

Strings do not require quotation. You can preserve line breaks with the ''|'' character; for example:

 # YAML
sonnet: |
    I wish I could
    write a poem
    but I can't

This would parse to:

{'sonnet': "I wish I could\nwrite a poem\nbut I can't\n"}

Trailing and preceding whitespace is trimmed out in the basic use case of ''|''. See the "Scalar indicators" section of the compact cheat sheet for modifiers to the ''|'' character.

Core to YAML is the concept of documents. A document is not just a separate file in this case. Instead, think of a document as just a chunk of YAML. You can have multiple documents in a single stream of YAML, if each one is separated by ''---'', like:

 # YAML
---
document: this is doc 1
---
document: this is doc 2
...

Using an ellipsis explicitly ends a document. The nice thing about documents is you can treat them as different entities. Let's say, "people" and "cars" are in the same file. You can use them for a bunch of entities that look alike, e.g.:

name: SomeObject
attributes:
    - attr1
    - attr2
    - attr3
methods: [ getter, setter ]
---
name: MyPrettyObject
attributes:
    - attr1
    - attr2
    - attr3
methods: [ getter, setter ]

which parses to:

{'attributes': ['attr1', 'attr2', 'attr3'],
 'methods': ['getter', 'setter'],
 'name': 'SomeObject'}
{'attributes': ['attr1', 'attr2', 'attr3'],
 'methods': ['getter', 'setter'],
 'name': 'MyPrettyObject'}

YAML also supports variables, or repeated nodes, which at first didn't click for me. The simplest explanation is that you define something as a variable by preceding it with ''&NAME value'' and you can refer to it with ''*NAME'' e.g.:

 # YAML
some_thing: &NAME foobar
other_thing: *NAME

Parses to:

{'other_thing': 'foobar', 'some_thing': 'foobar'}

As you can see, the syntax is pretty simple. It's easy to represent information in a way that is both clear, concise and, well... fun. What's really cool is the fact it meshes so well with Python!

Note that fans of JSON (JavaScript Object Notation) will quickly realize that the concise-version of the syntax (e.g. using ''[value, value]'') looks a lot like JSON. In fact, for the most part, JSON is a subset of YAML syntax. With a little bit of additional pre-processing you should be able to pass your JSON off as YAML and vice-versa.

And with that, PyYAML

After reading the basic of the syntax, you're jazzed to get started with YAML, right? Well, getting started with YAML is only a single ''easy_install'' away. The **PyYAML** module is pretty much the de-facto parser and emitter for YAML. The core of the module is written in pure Python, but, as of version 3.0.4, it also supports binding to the high-speed LibYAML implementation written in C.

PyYAML is blindingly simple to use for most cases. To generate all of the output I've used in the article so far, all I used was:

import yaml
import pprint
for project in yaml.load_all(open('test.yaml')):
    pprint.pprint(project)

The ''load_all()'' function goes back to the "multiple documents within a stream" concept. In the case above I am assuming that there won't be just a single document. I am using ''yaml.load_all()'', rather than ''load()'', then iterating over the results. ''yaml.load_all()'' returns a generator yielding each document in the stream. The ''yaml.load()'' function accepts a string (Unicode or otherwise), or an open file object.

For many cases, you'll be loading a single document. You might use it for configuration loading:

configuration = yaml.load(open('test.yaml').read())

Of course, one of the other aspects to PyYAML is dumping Python data structures to a YAML file. Take, for example, Listing 1:

#!/usr/bin/python

import yaml

mydata = {'person' : 'jesse',
          'hobby' : 'python',
          'employed' : True,
          'limbs': {'arms' : 2, 'legs' : 2},
          'family' : ['wife', 'toddler']}

print yaml.dump(mydata)

In this case, I am constructing a dictionary containing all of the data I want to include in the YAML file. Then I simply call ''yaml.dump()'' and the output of Listing 1 looks like well-formed YAML:

$ python Listing1.py
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse

Additionally, PyYAML includes ''yaml.dump_all()''. It accepts a list of objects to serialize and writes to the target stream. Let's make Listing 1 handle a series of objects:

mydata = [ mydata for i in range(2) ]
print yaml.dump_all(mydata, explicit_start=True)

And our output is fairly obvious:

---
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse
---
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse

By default, you don't need to pass additional arguments to ''yaml.dump()'' or ''yaml.dump_all()'', as you can see above. In the ''dump_all()'' example, I added the ''explicit_start'' argument. The dump functions support this flag, along with some others that you should know about, to control formatting.

The ''explicit_start'' argument adds the "---" string prior to the data structure being dumped. This allows you to dump multiple objects/documents to the same stream, say, an open file handle, without worrying about the document separators yourself.

Adding the ''default_flow_style'' argument changes the output from the default compact style of output, to the more verbose, "humane" output:

print yaml.dump(mydata, default_flow_style=False)

And the output:

employed: true
family:
- wife
- toddler
hobby: python
limbs:
  arms: 2
  legs: 2
person: jesse

You can also control indenting, width, and so on. You can also switch it to canonical mode, which explicitly defines the type of the value within the YAML:

print yaml.dump(mydata, canonical=True)

And the matching output:

!!map {
  ? !!str "employed"
  : !!bool "true",
  ? !!str "family"
  : !!seq [
    !!str "wife",
    !!str "toddler",
  ],
  ? !!str "hobby"
  : !!str "python",
  ? !!str "limbs"
  : !!map {
    ? !!str "arms"
    : !!int "2",
    ? !!str "legs"
    : !!int "2",
  },
  ? !!str "person"
  : !!str "jesse",
}

Yes, I just jumped the tracks on that last one. YAML and PyYAML both support explicit type declaration within the YAML documents. This is obviously handy for inter-language data exchange, but, as you can see in the output, is not so good on the side of readability if you're a non-programmer. On the other hand, it allows for a nice segue!

=h=Turning the Awesome Up=h=

We have covered the basics of YAML and, by extension, PyYAML, but PyYAML offers some additional niceties for Python users. Obviously, these advanced features start to edge out approachability, but they are actually really useful.

In the last example of the last section, we turned on the ''canonical'' flag to the ''dump'' function, which caused it to spit out explicitly typed YAML. Each type was in the format of

''!!''

. These are standard YAML tags, and they're fully covered in the spec.

Internally, PyYAML converts these tags to the expected Python types. ''!!null'' is ''None'', ''!!timestamp'' is ''datetime.datetime'', ''!!seq'' is ''list'', and so on. You don't need to explicitly put these in your YAML documents. In most cases the types are inferred from the document, but being able to explicitly define them is handy.

PyYAML can take the ''!!'' syntax a bit further though, and adds a series of Python-specific tags which are exceedingly useful. Each one of the Python-specific tags is prefaced with

''!!python/''

. PyYAML defines explicit Python types such as ''float'', ''complex'', ''list'', ''tuple'' and ''dict''. In my opinion, the ''tuple'' and the ''integer'' ones are more useful simply due to the fact that ''dicts'' and ''lists'' can be derived from the YAML file itself.

However, PyYAML also offers "non-type" ''!!python'' extensions. These are referred to as "Complex Python Tags" and they allow you to add things to your YAML document such as Python modules, packages, class instances, and the output of a method call with a passed-in variable.

Say we wanted to have a YAML file which defined some number of variables, but then passed one or more of them to a given module's method. I wanted something to list the contents of my home directory on parsing:

 # YAML
directory: &DIRECTORY /Users/jesse
contents: !!python/object/apply:os.listdir [*DIRECTORY]

And the abbreviated output:

{'contents': ['.bash_history',
              '.bash_profile',
              'todo.txt'],
 'directory': '/Users/jesse'}

Virtually any function can be called this way. You can also pass in keyword arguments and other data as required. Calling a function, though, is rather easy. Here's an example YAML file which uses the PyYAML ''new:module.class'' tag to create a ''Queue.Queue'' at load-time with a defined max size:

qsize: &SIZE 10
queue: !!python/object/new:Queue.Queue {maxsize: *SIZE}

Which, of course, passes you back the correct class instance:

{'qsize': 10, 'queue': }

In theory, and in my rather abusive practice, this would allow you to define a very rich configuration which constructed all of the relevant objects at parse-time to significantly alter the behavior of the application (or in my case, test) to which the YAML file was passed. One catch when you are using the ''!!python/object/*'' tag(s) is that the objects you are creating must be pickle-compatible.

For example, if you tried this:

 # YAML
threadpool:
 - !!python/object/new:threading.Thread
  target: myapp.myfunction

It would fail with an assertion error:

AssertionError: Thread.__init__() was not called

PyYAML is not calling ''__init__()'' when creating the object. Both ''yaml.load()'' and ''yaml.dump()'' are designed to work exactly like ''pickle.load()'' and ''pickle.dump()''. Objects must implement the pickle protocol.

Conclusion

YAML and, by extension, PyYAML, are incredibly useful if you want something easy on the eyes, easy to understand, and easy to use in a markup language. It's straightforward to customize, it's cross-language, and fundamentally simple. YAML is popping up in all sorts of places, such as the configuration settings for Google's AppEngine, and in Django, where it is used for a serialization format and to load data fixtures.

Obviously some of the advanced features of PyYAML are Python-specific, but the fundamentals make it an easy win for cross-language communication. Sure, XML does this, too, and there's support in every known language for XML parsing (including the stuff toddlers speak), but how readable is XML, seriously?

I do hope more and more people adopt this user-friendly format. It's simply great as a configuration language, and if you need to expose anything to humans and later serialize and deserialize it, just say "no" to XML.

The revolution will be readable.

Requirements:

Related Links


A (brief) introduction to Python-Core development | Completely Different

by jesse in , ,


This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the August 2008 issue.

In the early summer of this year I had the chance to really get started working on/with the core Python source. I had spent some time putting together a Python Enhancement Proposal (PEP) which was accepted. Now, I just needed to learn the code base, practices and buy a helmet. Shortly after getting the initial patch accepted, I ended up breaking the build, tests and caused the beta to slip. This article is an introduction to Core development, in which we'll cover what you need to get started, and where I personally screwed up.

Introduction

Core Python development (or, "hacking on python-core" as it may be called) is, like all great open-source projects, a highly distributed, highly active, and high participation project. There are developers all over the world filing bugs, submitting patches for code and documentation, as well as participating on the python-dev mailing list and IRC channel.

Like all other good open source communities, it's a meritocracy of the technical persuasion. A good idea is simply that: a good idea. If a good idea is the best of breed, it will be adopted or adapted to the language and project. If an idea or a patch is clear, concise, and solves a problem, there is generally no difficulty in getting traction or getting a patch put into core code base.

Let's start from the beginning

While Python is a meritocracy where any person can submit a patch, file a bug, or send emails to python-dev (sometimes, that last is more of a curse than a blessing), there is a particular group of people that has commit privileges. This group is responsible for judging all patches, proposed bugs and associated fixes, and ultimately committing the actual code to the tree.

Python's code, documentation, PEPs, and other artifacts are all hosted within a Subversion (svn) repository. While the core is in svn, you can also access it via other popular version control tools. There are Bazaar, Git, and Mercurial mirrors of the svn repository. All of the examples in this article will revolve around subversion, though, because the other trees are still experimental.

In order to view the repository, you need to check out a read-only version of the source tree. Write access is only available via svn+ssh authenticated access, but you can use HTTP for a read-only copy. So, to check it out:

mkdir -p python/trunk
svn co http://svn.python.org/projects/python/trunk python/trunk

This is your own, pristine copy: any edits you make in this tree will come up on a ''svn diff'' (which you'll use to make patches). Avoid editing files you don't need to so you don't accidentally taint a diff or checkin.

The basic layout of the tree is unsurprisingly simple, so I'll only really cover the important files/directories:

''Doc/'' contains all of the documentation for the language, which will be discussed in more detail later. If you want to see the standard library documentation, look in Doc/library.

You will find the brain-melting grammar definition for the Python language in ''Grammar/''.

Header files for C code go in ''Include/''.

Libraries written in Python are in ''Lib/''. You'll note a distinct lack of C code in this directory. That's because C modules go in the ''Modules'' directory. Also found in ''Lib/'' is the ''test/'' directory, which we'll be focusing on later. If you want to see some pretty Python code, read the files in this directory. Except anything I've done.

C extensions, such as multiprocessing, ctypes, cStringIO, et cetera can be found in ''Modules/''. Generally speaking, these are optimized modules for the standard library. Some of them are in subdirectories for cleanliness, but most of them are in the top level Modules/ directory. Note that there is a style guide for C code for the standard library, outlined in PEP 7.)

The ''Misc/'' directory contains things that don't belong elsewhere within the tree. This includes the NEWS file, build notes, configuration for valgrind (a code profiling/debugging utility), a cheat sheet (somewhat dated, but still useful), and some editor plugins. A really good file here is SpecialBuilds.txt, which goes over all the magic flags for Python builds you should know about.

Python objects are defined in ''Objects/''. It contains all C code, and is pretty well documented. If you suddenly get the urge to make a new type, start here.

Miscellaneous tools go in ''Tools/''. I haven't had to use much of anything down here except for the scripts in the ''scripts/'' subdirectory. The ''script'' directory is just filled with cool things like untabify.py, crlf.py, and google.py

There are two build files. The main build file, sort of, is ''setup.py''. I list it here because you need to look at this file to realize how things are built. The make steps we cover later are wrappers around this script for the most part. The the "other" build file is ''Makefile.pre.in''. It works with ''setup.py'' to control the entire compilation process and has some nifty targets, like "make tags". Who knew the build process could spit out a tags file for ''vi''?

It is important that you pay attention to both ''setup.py'' and ''Makefile.pre.in''. When I forgot one line in the Makefile, my extension module seemed to work, but didn't really. I could "import multiprocessing" from within the svn tree using the local python interpreter. However, after running "make install" the extension module was not installed, so it did not work with the installed interpreter. I finally discovered this was due to a single missing entry in LIBSUBDIRS.

Whew. That's a lot of directories. I skipped over the Windows build stuff, and I am going to continue to do so, noting that I am not a Windows expert. I do know that if you are on Windows you will need to look in the ''PCBuild/'' directory for build information, Visual Studio projects, etc.

Building

Before we go any further, let's walk through the basic build process. Remember, I'm a Linux and OS X guy, so I will be walking you through the steps you would take on a Unix machine. Windows users will need to either use Visual Studio, or install Cygwin (a Unix tool chain for Windows). Installing the Cygwin tool chain means you should be able to compile just fine following these directions.

First off, the ./configure step. If you're familiar with autoconf, automake, and the like, you're more than familiar with this. For those that aren't, the configure, make, etc. steps are common to configuring and compiling/installing a given application. See the link to Autoconf in the requirements section for more details. There are some custom options for configure (of course), which you can see with ''./configure --help''. The main one you want to know about and use is ''--with-pydebug'', which enables a special debug build of Python. You are going to want to have the debug build if you start heavily working on the core of the interpreter. The ''--with-pydebug'' flag enables, in no particular order, LLTRACE, Py_REF_DEBUG, Py_TRACE_REFS, PYMALLOC_DEBUG, C code assertions, and all code that has ''#ifdef Py_DEBUG'' blocks. In other words, it turns on just about every debugging feature you could possibly need or want, short of something that fixes your code for you automatically.

For the exact details on all of the configure flags, including platform specific options, see Misc/SpecialBuilds.txt.

To start a build, just fire off a

$ ./configure --with-pydebug

in ''python/trunk''. Once this is done, unless you really want to twiddle the options, you shouldn't need to do this again for a while. Brett Cannon once told me, when talking about some development TextMate macros, "I left out configure stuff because that becomes rather personal".

Next up, execute ''make'' in the python/trunk directory. You'll see your normal make output, but there are a few caveats to keep in mind.

Here is some example output from the ./configure and make steps:

$ ./configure
checking for --with-universal-archs... 32-bit
checking MACHDEP... darwin
checking EXTRAPLATDIR... $(PLATMACDIRS)
...snip...
creating Modules/Setup
creating Modules/Setup.local
creating Makefile
woot:python-trunk jesse$ make
... gcc output snipped ...
Failed to find the necessary bits to build
these modules:
_bsddb             gdbm               linuxaudiodev
ossaudiodev        readline           spwd
sunaudiodev
To find the necessary bits, look in setup.py in
detect_modules() for the module's name.

running build_scripts
$

Pay attention to the build output. If you're working on a module with C extensions or the interpreter itself, what can go wrong here will go wrong. For example, while working on integrating the _multiprocessing library to ''Modules/'', the initial issues around simple compilation were exposed here.

As you can see, there is an important report at the end of the make step (the log line looks like: "Failed to find the necessary bits to build these modules:"). The information given in that report is especially important if you need access to the skipped modules. For example, on OS X the ''readline'' module doesn't compile out of the box. You will need to resolve the dependencies listed in ''trunk/setup.py'' in order to get it up and running.

If you want to "quiet down" the make step, adding the "-s" flag will make it less verbose. Also, if you want to speed it up, consider using the "-j NUM" to increase the number of concurrent commands being performed.

Once the build completes successfully, you should have a working Python binary in your local directory. On OS X and Windows it's named ''python.exe'' and on Unixes it's named simply ''python''. If you wanted, you could fire this version up and poke around, but for development your next step should be to run the tests.

Running Tests

Python's source tree's tests are primarily executed with the ''Lib/test/regrtest.py'' utility (this may change in the future) and ''make test''. If you were to run ''make test'' in the ''trunk/'' directory right after building, you would run a subset of all of the tests located in ''Lib/test''. Certain tests, such as large file tests and others that take a lot of time or resources are excluded in favor of brevity.

For details on what a ''make test'' step does, open Makefile.pre.in and search for "# Test the interpreter" (it should be around line 660). You will find the definitions for what happens during the ''test*'' steps as well as the options that invoke ''regrtest.py''. You can change the test options via the ''TESTOPTS='' flag to ''make test''. For example, to run a single test:

$ make test TESTOPTS=test_multiprocessing

The real magic happens in regrtest.py, the Python regression test execution script). You need to run this for any change made to the code, period. A basic run is the same as the basic ''make test'' execution. This means that certain tests are excluded, but you can enable those tests (and a lot more) via additional arguments to regrtest.py. There is even an option to enable coverage analysis.

A basic invocation of regrtest.py looks like this:

$ ./python.exe Lib/test/regrtest.py
test_grammar
test_opcodes
test_dict
...snip...
test_zlib
327 tests OK.
32 tests skipped:
    test_al test_bsddb test_bsddb3 test_cd test_cl
    ...
    test_winsound test_zipfile64
Those skips are all expected on darwin.

Pretty painless, but if something goes wrong, there's not a lot of information to go on. A better way to run it is with the ''-w'' option, which will re-run any failed test with additional verbosity. For example, I added a line that would cause one of the tests to crash in Listing 1.

Listing 1:

$ ./python.exe Lib/test/regrtest.py test_multiprocessing
test_multiprocessing
test test_multiprocessing crashed -- : name 'mportasdl' is not defined
1 test failed:
    test_multiprocessing
$ ./python.exe Lib/test/regrtest.py -w test_multiprocessing
test_multiprocessing
test test_multiprocessing crashed -- : name 'mportasdl' is not defined
1 test failed:
    test_multiprocessing
Re-running failed tests in verbose mode
Re-running test 'test_multiprocessing' in verbose mode
test test_multiprocessing crashed -- : name 'mportasdl' is not defined
Traceback (most recent call last):
  File "Lib/test/regrtest.py", line 549, in runtest_inner
    the_package = __import__(abstest, globals(), locals(), [])
  File "/Users/jesse/open_source/subversion/python-trunk/Lib/test/test_multiprocessing.py", line 6, in 
    mportasdl;fj
NameError: name 'mportasdl' is not defined
$ 

There's one more important flag to regrtest.py you need to know about, and that's ''-uall''. This option will run all of the tests, and obviously, when you're changing something really low level, you need to run these tests. They take a long time, so I recommend running them before going to bed.

Documentation

Yes, even documentation has bugs. All of Python's documentation resides in the ''Doc/'' directory, and it has its own build scripts and system, called Sphinx. The standard library documentation module overviews we all know and love are located in ''Doc/library/''. When you are making a change that will be public in nature (say, adding a method) you need to find and update the associated documentation.

Also, when adding new packages, modules or methods, you should really consider adding an example in the appropriate section of the module's .rst file (not the ''Doc/examples'' directory). It is common for new Python users to have difficulty finding clear examples on standard library module usage, so the more examples the merrier.

If you're stuck with the documentation, feel free to send an email to docs@python.org and ask for help. There are a lot of good people signed up for that list and they're willing to help you if you're stuck.

The documentation is all in ReST (ReStructured Text) format and there is some Python-specific syntax that can be of use to you. See the "Documenting Python" page for more information. A nice nugget I found was breaking the bigger examples out of the main ''module.rst'' file (the documentation file for a give module, in ReStructure Text format), and include them separately with:

.. literalinclude:: ../includes/mp_webserver.py

This means you can drop the python code into the ''Doc/includes'' directory and it will be popped in place when the documentation is built.

When you want to try building the docs, simply go into ''trunk/Docs'' and type ''make html'' to convert all of the documentation into the HTML files you know so well from the Python doc site. Don't worry about installing Sphinx in advance, the build rules do that for you. Once built, the html documents live in ''Doc/build/html''.

At very least, whenever you make a change to core, you should update the ''Misc/NEWS'' file to add a brief description of your change, and also add your name to ''Misc/ACKS''.

Making a change

Let's assume for the moment you're about to provide a patch to fix a bug from the python bug tracker. Most fixes will require the following minimal changes:

  • Updated Python module
  • Updated documentation (At least an entry in the NEWS file)
  • Updated Tests (you will update the tests)

In a few cases you also will need to update the C code. After you've done the initial check out of the branch you'll be working on, and you've confirmed the build and tests pass on your machine, you should be set to make your changes locally, apply any patches you are testing, etc.

When you're updating or adding new tests you need to drop into the ''Lib/test'' directory and find the "best place" for the test. Typically, if you're making a bug fix, you're simply going to append the test onto the suite for the module. Larger scale changes, including creating new packages or modules, will need their own ''test_*.py'' file in ''Lib/test''.

It's important when you're adding tests that your tests are clear, well documented, and most of all smart. They will need to know when not to run (say, a network test should not run when no network is present) and they need to be reliable (i.e.: they should never just hang). The tests and code you submit will be viewed by many people, and compiled and tested on more platforms than most of us have ever used. The smarter you make the test, the better off everyone will be.

An important tool in the test developer's arsenal is the ''test_support'' library included in ''Lib/test/test_support.py''. In it you will find a variety of functions, exceptions, and tools to help you to write core tests. Most of all, look at the other tests!

Once your changes work, you should run a ''make check'' to perform some housekeeping operations you want to do prior to generating the diff. These include fixing whitespace, checking the NEWS/ACKS file for updates, and reminding you to run the test suite! See ''Tools/scripts/patchcheck.py'' for everything ''make check'' does.

On Code Bombs

It's important to avoid making widespread changes in a vacuum. Large scale refactoring or changes to an API used by a lot of the standard library should be reviewed carefully and often. Typically, it's better to post an initial patch up on the bug tracker and then revise it as other people/contributors make comments than to drop a huge patch on everyone and say "it's done".

A recent python-dev post from Guido highlighted this issue, the take-away quote (from both his email, and the blog post he linked to) being: "The story's main moral: submit your code for review early and often; work in a branch if you need to, but don't hide your code from review in a local repository until it's 'perfect'." For more details, see the "Code Bombs" thread listed in Related Links above.

One of the tools at your disposal for publishing patches for review is Rietveld, the review application created by Guido Van Rossum. Typically, if you have a small enough change, putting a patch in the bug tracker is sufficient.

How do you generate a patch, big or small? It's easy: cd into your ''trunk/'' directory and run ''svn diff >mychange.patch''. This will create a patch containing only your changes which can then be uploaded to the bug tracker, emailed to the community, etc.

Applying the patch is also easy. Just hop into the ''trunk/'' directory and run ''patch -p0

Conclusion

A good first step to contributing to core is to consult the bug tracker. There you can find everything from mind-melting interpreter issues to simple one-line fixes (famous last words). There's even a query to find "Easy" issues (see the sidebar on bugs.python.org).

One great thing about Python development is that anyone can propose an idea. Should it stand on it's own merit, it will probably be accepted. So even if you don't find a bug in an area you're passionate about, why not find something you are interested in and make a Python Enhancement Proposal for the change? Publish it to python-dev and put together the patch for the code. You can do this for existing modules or even new ones.

Ultimately, Python is your language. Without the people constantly contributing to core in the form of bug fixes, documentation and new programming concepts, Python would simply die on the vine. The more help, the better the language becomes, and the wider the appeal and audience.

Related Links


Get with the program as contextmanager | Completely Different

by jesse in , ,


One of the cooler features that came with Python 2.5's release is the 'with' statement and the context manager protocol behind it. I could make the argument that these two things alone make the upgrade to Python 2.5 more than compelling for those of you trapped in the dark ages of 2.4 or worse: 2.3!

This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the July 2008 issue. I have republished this in its original form, bugs and all

Introduction

In Python 2.5, a with_statement hook was added to the ''__future__'' module . This was brought on by PEP (Python Enhancement Proposal) 343, "The with statement". PEP 343, like many PEPs in Python, was a fusion of good ideas into a rather elegant solution. See http://www.python.org/dev/peps/ for a complete listing of PEPs, including those referenced in this article.

Two of the influencing PEPs, 310 (Reliable Acquisition/Release Pairs) and 319 (Python Synchronize/Asynchronize Block) were primarily focused on a system to add a simple method of acquiring and then releasing a lock. PEP 310 proposed the ''with'' statement (i.e., ''with lock:'') and PEP 319 proposed ''synchronized'' and ''asynchronize'' keywords that would allow you to define an function or method that would use the proposed keywords to access and modify shared objects, essentially hiding the common form of managing the lock directly:

initialize_lock()
...
acquire_lock()
try:
    change_shared_data()
finally:
    release_lock()

While both PEPs 310 and 319 were (are) good ideas, there were additional influences from other PEPs as well. PEP 340, "Anonymous Block Statements", and PEP 346, "User Defined ('with') Statements", by Nick Coghlan were both important. In the end, what I think is an elegant and powerful middle ground was reached.

If you want a very detailed overview of all of the reasoning behind the introduction of the with statement, I recommend reading PEP 346 http://www.python.org/dev/peps/pep-0346/, where Nick Coghlan explains it in excellent detail with many examples.

Context Managers

The key thing to understand about ''with'' and all of the work in the PEP is that under the covers, when you write:

with EXPRESSION [as VARIABLE]:
    BLOCK OF CODE

The EXPRESSION is expanded into two calls. The first call is to the ''__enter__()'' method on the object. After the nested block completes, the object's ''__exit__()'' method is run. "as VARIABLE" is in brackets because it is an optional argument to the expression to store the return value of EXPRESSION to the BLOCK as VARIABLE name.

Take a look at Listing 1 for an example. In order to illustrate the methods and call order, I've created a simple class, Foo, that defines the required protocol methods. At the bottom of the listing. When an instance of Foo is used in the ''with Foo()'' call, the output is simply:

I like turtles

Listing 1:

from __future__ import with_statement

class Foo(object):
    def __init__(self):
        pass
    def __enter__(self):
        print "I"
    def __exit__(self, type, value, traceback):
        print "turtles"

with Foo():
    print "like"

As you can see, the ''__enter__()'' method is called on the object, control is released and the "print turtles" code block is executed. Once the block is completed, the ''__exit__()'' method is called.

Per the PEP, the ''__enter__()'' method on the object accepts no arguments, but can perform actions (in this case, print) or return data. If an object has no data to return it should return self, although that is not required.

The ''__exit__()'' method on the object has to accept three arguments: type, value, and traceback, these correspond to the arguments to the ''raise'' statement. These arguments are passed in because the context manager handles all exceptions during ''__exit__()''. For example, if type is ''None'' then that indicates that the nested block executed successfully, without error. Otherwise the ''__exit__()'' method can properly handle the exception condition and clean up the resource.

For example, you might ask what happens to the ''__exit__()'' method execution if an exception is raised when the code block is executing. Let's examine this further by changing the bottom part of Listing 1 to be:

with Foo():
    raise Exception

The output now looks like this:

I
turtles
Traceback (most recent call last):
  File "scratch.py", line 12, in 
    raise Exception
Exception

If the code block being executed raises an exception, ''__exit__()'' is still called on the Foo() object. This makes it darn handy for, say, cleaning up locks, database handles, sockets, unruly children, etc. Early I mentioned that objects that define the new protocol could also return ''self'', which would then be packed into the variable defined in the [as VARIABLE].

Listing 2 provides a class with an ''__enter__()'' method that returns the instance of the object for access by the code block. In the example, the instance of the object is associated with the variable name "baz". Take a look at the output:

setting count to 0 <__main__.Foo object at 0x73bb0> count is now: 4

Listing 2:

from __future__ import with_statement

class Foo(object):
    def __init__(self):
        pass

    def __enter__(self):
        print "setting count to 0"
        self.count = 0
        return self

    def __exit__(self, type, value, traceback):
        print "count is now: %d" % self.count

    def incr(self):
        self.count += 1

with Foo() as baz:
    print baz
    for i in range(4):
        baz.incr()

As you can see, within the for-loop in the main block of code we were able to alter the state of the object we're reliant on. We can access all of it's internals, change state, call methods, etc. Again, this is especially handy if you want to create something that acts as some sort of handle.

Let's look at two snippets, the old way of declaring a lock, then later acquiring it to modify state:

lock = RLock()

class thread_object(Thread):
    def run(self):
        lock.acquire()
        try:
            print self.getName()
        except:
            raise Exception("Something is broken")
        finally:
            lock.release()

Now, let's look at code refactored to use ''with'':

lock = RLock()

class thread_object(Thread):
    def run(self):
        with lock:
            print self.getName()

This is possible because threading.RLock implements the new context manager protocol, go ahead, take a peek at threading.py yourself or look at the code below:

class _RLock(_Verbose):
    __enter__ = acquire
    ...snip...
    def __exit__(self, t, v, tb):
        self.release()

The lock management classes are not the only ones to implement the protocol. The io.py, tempfile.py, and other modules all implement the protocol to allow you do do something like the following:

with open("hey", "r") as mfile:
    mfile.readlines()

This will automatically open, and close the file on the way in and way out. Magic! Obviously, the simple way of thinking of these is as resource managers. For example, what if you wanted to ensure a given state was set for a particular code block? PEP 346 points out an excellent example of disabling signals during the BLOCK execution. Take a look at Listing 3 where I have implemented that very code to simply catch and ignores SIGABRT signals.

When the script is run in one window, and in another we start running "kill -6 ", we see:

Tis but a scratch!
Tis but a scratch!
I got an abort, but I like it here.
Tis but a scratch!
Tis but a scratch!

Listing 3:

from __future__ import with_statement
from contextlib import contextmanager
import signal

def handler(signum, frame):
    print "I got an abort, but I like it here."
    pass

@contextmanager
def no_sigabort():
    signal.signal(signal.SIGABRT, handler)
    yield
    signal.signal(signal.SIGABRT, signal.SIG_DFL)

with no_sigabort():
    # code executed without worrying about signals
    while True:
        print "Tis but a scratch!"

Instead of passing in the handler function on line 12 we could also pass in signal.SIG_IGN - which just makes the signal ignored. You can easily catch all sorts of state and react to it. Another one of the examples in PEP 346 is committing or rolling back database transactions:

def transaction(db):
    try:
        yield
    except:
        db.rollback()
    else:
        db.commit()

Using this style, your code becomes a lot more succinct, clear, and you drastically reduce the amount of boilerplate you have to add to your application.

Contextlib

As part of Python 2.5 a new module ''contextlib'' was introduced. This module is an excellent reference point of how to use context managers (it's great example code!). It also provides some pretty cool tools. You've already seen me use contextlib.contextmanager to remove the need to define an object with ''__enter__()'' and ''__exit__()'' methods on the last example.

The contextlib.contextmanager decorator allows you to create nice user statements out of a simple function that yields at one point in the middle. This means you could do:

@contextmanager
def test_setup():
    start database...
    inject fake data...
    yield (to test)
    confirm result...
    shut database down...

Which allows you to:

def mytest():
    with test_setup():
        ... do stuff ...

You can technically do anything else you want within that decorated function, and it can take as long as you want as long as:

- It yields once. - It does not yield again after an exception is raised.

The other nice thing is that you could change the test_setup example above to accept any number or type of arguments, so tests could pass identity and other information into the test_setup function.

Now let's turn this up to 11. Up until now, I've shown you simple examples - basically, how to get/set some resource and then release it. But did you know you could nest them? Via the contextlib.nested function, you can define a series of nested contextmanagers and then bind each one to a different variable name.

Let's try a simple nested context out for starters. In the first example in Listing 4, we want to move the data from file1 to file2. It's easy to list the open file handles as arguments to ''nested()'', but what about mixing types? The second example in Listing 4 (lines 8-11) mixes file handles with thread locks.

Listing 4:

#!/usr/bin/env python
from __future__ import with_statement
from contextlib import nested

with nested(open("file1", "r"), open("file2", "w")) as (a, b):
    b.write(a.read())

from threading import RLock
lock = RLock()
with nested(lock, open("file1", "r"), open("file2", "w")) as (a, b, c):
    c.write(b.read())

Yes, we have officially crossed into maybe that's too much territory. But, you can see we can pass in any number of contextmanagers and all of them will be handled as needed. This is great if, like above, you need to acquire a lock and then perform an action which requires some cleanup.

Finally, we have contextlib.closing. This is, as the documentation states, "a context manager that closes thing upon completion of the block". Anything with a ''close()'' method is eligible to be used here. At last count on trunk, ''close()'' occured at least 71 times in the Lib directory. You can use ''closing'' on URLs from urllib, StringIO objects, as well as gzip objects.

For example, from the standard library documentation:

from __future__ import with_statement
from contextlib import closing
import urllib

url = 'http://www.python.org'
with closing(urllib.urlopen(url)) as page:
    for line in page:
        print line

All three of these make it easy to factor-out code which we all end up repeating; that's the nature of boilerplate. As we all know, less boilerplate and copy and pasted code means easier to read, and easier to manage.

Let's Go Off-Roading

As I was writing this, I was trying to think of something really interesting to do with an object defining ''__enter__()'' and ''__exit__()'' methods that wasn't just resource management. Then I realized, given I'm doing a lot of parallel stuff right now, I could create a threadpool that allowed jobs to be submitted to it, and the ''__exit__()'' would call ''join()'' on the threads and so on.

Fantastic idea! Within Listing 5, I have defined a basic thread object that subclasses threading.Thread. Then, in Listing 6 I define a ThreadPool, which is the context manager I will use.

Listing 5:

from __future__ import with_statement
from threading import Thread
from Queue import Empty
from Listing6 import ThreadPool

class myThread(Thread):
    def __init__(self, myq):
        Thread.__init__(self)
        self.myq = myq
    def run(self):
        while True:
            try:
                job = self.myq.get()
                if job == 'STOP':
                    break
                print self.getName(), job
            except Empty:
                continue

with ThreadPool(10, myThread) as pool:
    for i in range(100):
        pool.put(i)

Listing 6:

from Queue import Queue

class ThreadPool(object):
    def __init__(self, workers, workerClass):
        self.myq = Queue()
        self.workers = workers
        self.workerClass = workerClass
        self.pool = []

    def __enter__(self):
        # On entering, start all the workers, who will block trying to
        # get work off the queue
        for i in range(self.workers):
            self.pool.append(self.workerClass(self.myq))
        for i in self.pool:
            i.start()
        return self.myq

    def __exit__(self, type, value, traceback):
        # Now, shut down the pool once all work is done
        for i in self.pool:
            self.myq.put('STOP')
        for i in self.pool:
            i.join()

Note that ThreadPool returns a value from __enter__(). After it builds up the worker-pool, instead of returning ''self'' (which would be silly), it actually returns the queue built in the constructor. This makes it so that when we call it on line 20 in Listing 5, we get the reference to the queue we need.

Now, this is a nominal example. We're not returning any results or anything, we're just printing the numbers off of the queue as we get them. But it demonstrates the concept of creating an object that tracks some state, sets up a resource, and then ultimately manages that resource.

In Listing 6, I made sure we built the pool at ''__enter__()'' time rather than in the constructor because what happens if we need to do more customization or hit an exception? If we do hit an exception, we will immediately jump out and the BLOCK we're running will not be executed. In the ''__exit__()'' method, I insert STOP tokens to tell the threads to exit their work loop.

If you wanted, you could use this code inside of your own application (once you make it so it returns data to the caller) to spawn worker pools on-demand, do some processing, and then cleanly shut them down with a minimal amount of boilerplate involved.

The nice thing about this is that all of the responsibility for management is done in the object that does all of the work itself. There is no more needing to remember to shut down the worker pool, release the database connection, or close that socket.

Conclusion

I hope I've shown you a compelling new feature within Python that you might not have known about. Python is evolving rapidly every day. We don't just have things like context managers and Python 3000 to look forward to. We have a wealth of improvements going into core every single day.

I think people are going to really love context managers for their elegance, once they become mainstream to the language (in 2.6). Centralizing the control and management of state, resources and other-like things while reducing the total lines of code you have to debug, manage and read is a good thing.

Well, as long as the end result is still readable.

Related Links:


An Interview With Adam Olsen, Author of Safe Threading | Completely Different

by jesse in , ,


This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the June 2008 issue.

A world without a Global Interpreter Lock (GIL) - the very thought of it makes some people very, very happy. At PyCon 2007 Guido openly stated that he would not be against a GIL-less implementation of Python, provided someone coughed up the patch itself. Right now, that someone is Adam Olsen - an amateur programmer who has been working on a patch to the CPython interpreter since July of 2007.

It's PyCon. I'm supposed to be listening to a talk, but I've fallen down the rabbit hole of a future without a global interpreter lock. I'm locked in on getting a patched version of the interpreter up and running on Mac OS/X and the patch author, Adam Olsen, is coaching me through changes to some of the deepest internals of Python itself. For about a year, Adam has been working on the "safe threading" project for Python 3000. In this project, he has attempted to address many of the common issues programmers facing highly threaded and highly concurrent applications. These problems include deadlocks, isolation of shared objects (to prevent corruption/locking issues) and finally, as a side-effect of making threading safer, the removal of the Global Interpreter Lock.

Adam would be the first to point out that adding ''--without-gil'' to the Makefile for the C version of the interpreter was actually a side-effect of the bulk of his work. At 938 kilobytes, I would say his diff against the CPython code base that produces an interpreter with a safe, clear, and concise threading model for local concurrency is a bit more than a side effect.

It is clear that he lives for a concurrent and threaded world, and Adam has filled in a lot of gaps in my knowledge about concurrency in our past conversations. I've been lucky enough to interview him about the safe threading project and his outlook on all of this as well.

First off, what's your background?

I'm an amateur programmer, self taught. I've had a long interest in object models and concurrency, such as how widgets in GTK interact.

I've explored twisted a bit, as well as Python's existing threading. Additionally, I've experimented a great deal with different ways to utilize generators or threads, actors, futures, cooperative versus preemptive scheduling, and so on.

Can you explain the basic premise behind the Safe Threading part of the project?

Make the common uses easy. Don't necessarily make it impossible to get wrong (everything is a tradeoff!), but give the programmer a fighting chance.

How about the "Free Threading" part (--without-gil)?

Everybody seems to know you don't need locking if you're not modifying an object, but Python demands a traceback, not a segfault if the programmer gets it wrong. Monitors and shareability provide a framework that satisfies both.

In essence, removing the GIL was a bonus to avoiding unintended conflicts between threads.

Many people accuse the "threaded programming" paradigm as impossible to get right. Even Brian Goetz has stated that it is extremely hard to "get right". If this is the case, why try to "fix" threading in Python?

What most people see as a problem with threads, I see as a problem with the memory model. They let threads modify the same object simultaneously, resulting in arbitrary, ill-defined results. The complexity here explodes, quickly turning the programmer's brain into mush.

The solution is to isolate most objects. Keep all these mutable objects back in the sequential, single-threaded world. Multiple processes let you do this. Actors do it too. So do monitors.

editor: Actors can be thought of as Objects (for Object Oriented programmers) except that in the Actor model, all Actors execute simultaneously and can create more Actors, maintain their own state, and communicate via asynchronous message passing. A Monitor can be thought of as an object that encapsulates another object intended for use by multiple threads. The Monitor controls all locking semantics for the encapsulated objects.

So from that view, processes, actors, and monitors are all equivalent. The only reason I use monitors and build them on OS threads is that it fits better with the existing Python language and is much more efficient for the way Python uses them. I could take a page from Erlang and call them "processes", but I think in the long run that would be more confusing, not less.

In looking through the diff for safe thread, you've had to touch a lot. Everything from object allocation to the entire way threads are managed. What was the hairiest series of changes you've had to make?

Hard to say - I've been at this nearly a year with still lots to do. I could mention when I found that atomic refcounting didn't scale, it spelled doom for the removal of GIL until I came up with a viable asynchronous scheme. Straightening out object allocation was also pretty nasty, as stock CPython uses a twisted maze of macros and wrapper functions for that.

The worst was probably deciding how to handle class and module dictionaries. What you've got here are mutable objects, inherently used simultaneously by multiple threads, no clear cutoff point after which they're no longer modified, and a massive amount of implicit accesses ingrained as a fundamental part of the language. I really wanted to impose some order on this, add some clear boundaries to when they're modified versus accessed, and make them live up to the "explicit is better than implicit" ideal.

I couldn't do it though. Implicit access is too ingrained into the language. Eventually I conceded defeat, then embraced it, codifying dict's existing API as shareddict's actual API. In doing this I also switched to a relatively simple read/write lock as shareddict's protection (relative to what came before!). In the end, the only restriction was that the contents of shareddict must themselves be shareable.

Isn't a monitor equivalent to adding an @synchronized or a with:lock statement around your code? Is using a monitor for the mutable objects that much faster than lock.acquire and lock.release?

editor: See Listing1.py for a monitor example. You will need to be running Python 3000 with Adam's patch for the code to work.

Listing 1:

# See the requirements: You must apply Adam's patch to Python 3000
# for this.
from __future__ import shared_module
from threadtools import Monitor, monitormethod, branch

class Counter(Monitor):
    """A simple counter, shared between threads"""
    __shared__ = True  # More shared_module boilerplate
    def __init__(self):
        self.count = 0

    @monitormethod
    def tick(self):
        self.count += 1

    @monitormethod
    def value(self):
        return self.count

def work(c):
    for i in range(20):
        c.tick()

def main():
    c = Counter()

    with branch() as children:
        for i in range(10):
            children.add(work, c)

    print("Number of ticks:", c.value())

Superficially, yes, but it has deeper semantics as well. The biggest is that it imposes a shareability requirement to all objects passed in or out, and there's no way to bypass it. This basically forces you to be explicit about how you expect threads to modify mutable objects.

It also lets me use a bit saner recovery from deadlocks than would be possible using with:lock. Not that much though and there's certain circumstances when there are no ideal ways to recover.

Performance wise, lock.acquire/lock.release are irrelevant. The real competition is with adding a lock to every object, such as a list. What seems like a simple ''if x: x.append(42)'' actually requires 2 acquire/release pairs - something like ''x.extend(y)'' would require a pair for every item in y. This could easily add up to thousands of lock operations where a monitor lets you get away with just one.

How did you handle Garbage Collection?

Painfully. I originally attempted to use simple atomic integer operations for the refcounting, but I found they didn't work. Well, they were correct, but they didn't give me the benefit of removing the GIL. Multiple CPUs/cores would fight over the cache line containing the refcount, slowing everything to a crawl.

I solved that by adding a second mode for refcounting. An object starts out like normal, but once a second thread accesses the refcount it switches to an asynchronous mode. In this mode each thread buffers up their own refcount changes, writing out all the changes to a single refcount at once. Even better, if the net change is 0 it can avoid writing anything at all!

The catch is, you can no longer delete the object when the refcount hits 0, as another thread might have outstanding changes. Instead, I modified the tracing GC to keep a list of all objects and had it occasionally flush the buffers and check for objects with a refcount of 0.

Your Branching-as-children method (see Listing 2) of spawning, implemented in your patch, deviates from the current threading module approach. Why not just overlay your work on the existing API?

Listing 2:

with branch() as children:
    for i in range(10):
        children.add(work, arg1, arg2)

Branch basically wraps up best practices into a single construct. It propagates exceptions, handles cancellation, lets you pass out return values, and ensures you don't accidentally leave a child thread running after you've returned.

You can still leave threads running after a function returns, you just need to use a branch that's higher up in the call stack. Later on, I might add a built-in one just above the main module just for this purpose.

What about stopping/pausing child threads?

Pausing isn't possible, but cancellation serves the purpose of stopping. Essentially it sets a flag on that thread to tell it to stop, as well as making sure participating I/O functions will check that flag and end themselves promptly.

How did you handle thread-safe imports?

Most of this isn't implemented yet, but the basic idea is that each module will be either shareable or unshareable. Unshareable modules work normally if imported from the main thread, but if another thread tries to import one they won't get past the parsing phase - just enough to try to detect ''from __future__ import shared_module''.

Modules found to be shareable are placed in their own MonitorSpace (the underlying tool used by a Monitor) before the Python code in them is executed. This separates them from the main thread, so I won't need the main thread's cooperation to load them.

In your implementation, you use libatomic-ops, essentially adding a new python build/library dependency - what does this buy you over using standard locking primitives?

Scalability. I can use an atomic read and, so long as the memory doesn't get modified, all the CPUs/cores will pull it into their own cache. If I used a lock it would inherently involve a write as well, meaning only one CPU/core would have it cached at a time.

For some applications it also happens to be a great deal lighter than a lock. It may be both easier to use and faster.

You state on your page about the Dead Lock Fallacy, that "Ultimately, good style and a robust language will produce correct programs, not a language that tries to make it impossible to go wrong." What language tries to make it impossible to go wrong? Why (again) not just ditch threading and move to, say, Erlang?

Concurrent Pascal would be the great old example - they introduce monitors there, but apply a great deal more restrictions as well. Ultimately though, the language is focused on hard real-time applications, and it shows. Python and safethread are focused on general purpose applications, so usability is more important.

Erlang's a pretty similar situation. It was designed for real time, distributed, fault-tolerant applications. It wants you to use one-way messages (not the two-way function call). It copies everything passed in those messages. Good tradeoffs for its focus, but bad for a general purpose language.

What sorts of CPython bugs have you found delving this deep into the codebase?

Just a few scattered little ones. My favorite was a refcounting bug involving dicts, but it could only occur using recursive modification or threading - obviously with shareddict I make the latter a little more likely (but only recursive modification is possible with the normal dict).

Although, that's not including the threading/interpreter state APIs. Most of that code was pretty messy; lots of bugs lurking around. It was quite satisfying to rip it out.

How have your changes altered the API that C extension writers use? Given that the "bonus" of the GIL is a simple interface for extension writers via the Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS macros - does safethread introduce more complexity?

Most of my changes are cleanup and simplification. The tp_alloc/tp_free slots are gone - everything uses PyObject_New and PyObject_Del. PyObject_GC_New/Del are gone too. The old GIL macros are directly replaced by PyState_Suspend()/PyState_Resume().

However, there are new options to take advantage of. Extensions doing I/O or other blocking operations should use the cancellation API. Modules wishing to be shareable should be audited, then apply Py_TPFLAGS_SHAREABLE/METH_SHARED, as appropriate. However, if they do that they also need to call PyArg_RequireShareable/PyArg_RequireShareableReturn if there's the potential to share objects between threads (MonitorSpaces technically).

You don't support the old threading API right now - but would it be possible to add in backwards-compatible support, or is it simply unfeasible?

For the C API, it'd be easy to retain the old GIL macros. Other parts may not be so easy.

For the Python API, some are easy, some aren't possible, and some are just painful.

Adding equivalents to Lock, Semaphore, and Queue is easy. Easier than the originals in fact. Getting all the minor details right (such as, if you subclass it) might be harder/impossible. Lock would not support deadlock detection, but it would be cancelable.

Daemon threads will likely not be supported, but, in my opinion, they're broken by design anyway.

The painful part is resurrecting the GIL, so these "classic" threads can share arbitrary objects like they always did. However, I won't make it so global - they'll acquire/release the main MonitorSpace instead, so all the new-style threads (created using branch()) will not be slowed down.

Finally, you've pointed out that "real threading" does not equal distributed programming, only local concurrency (i.e: support for multiple cores). What do you think Python could do to support distributed computing (providing the GIL-less world comes to fruition)?

At this point, it's confusing. Much of the focus is to work around the GIL, to take advantage of multiple cores. With safethread integrated into Python, I think many of the distributed/multiprocess projects would die off. What's left would be the ones that *really* want to be distributed and need multiple boxes, not multiple cores.

In my mind, there are three main characteristics of distributed programming; although, a given framework, may only use one or two:

- security - you don't trust the other nodes, they don't trust you. This often takes the form of sandboxes on a local box or capability system. - fault-tolerance - a hardware failure on one box should only bring down that box, not every other box connected to it. Upgrading the software of one box at a time should also be possible. - latency - asking another node (even on a LAN) can easily be several orders of magnitude slower than reading from your own RAM or, even better, your cache.

All these lead to different tradeoffs. You really need to minimize the communication between nodes by pushing them apart, whereas safethread is only concerned about making it easier to write correct programs.

The bottom line is that safethread lets you do the easy stuff (local concurrency) so that you only need to do the hard stuff (distributed programming) when you really need it.

Conclusion

Everyone is welcome to download, contribute, and try out Adam's patches. Bug reports, code, emails are all welcome. There are active discussion on the Python 3000 mailing list about all of this and more suggestions are welcome.

In all, there is a lot of interest in Adam's work. There was a lot of discussion around concurrency, threads, and the GIL at Pycon this year, and with Python 3000 coming down the pipe with the "multicore future" looming, things are getting interesting.

Related Links