YAML ain’t Markup Language | Completely Different

When someone says “pick a markup language,” most people would immediately respond with “XML!”, but there’s an alternative out there. YAML is human-readable, easy to use, and overall quite fantastic.

This is a reprint of an article I wrote for Python Magazine as a Completely Different column that was published in the December 2008 issue. I have republished this in its original form, bugs and all


I hate markup languages. There, I said it. The first time I had the pleasure of “using” (being abused by) XML, I said to myself “there has got to be a better way of doing this.” Well, after years of sticking with plain text ini files and custom syntaxes based off of using ”eval()”, I’ve come to not only use, but love, YAML.

YAML, or “YAML Ain’t Markup Language”, is “a human friendly data serialization standard for all programming languages.” It has the advantage of leaning towards dynamic languages a la Python, Ruby, etc.

It is important to note that friendliness and readability are very core to the design of YAML. The number of format characters is very low and, like Python, YAML’s markup can use whitespace to indicate scoping of items. Tabs are not allowed, so there is no chance for confusion about indention level. Additionally, the constructs within YAML such as mappings, sequences, and scalars all mesh nicely with existing Python data types like dictionaries, lists, strings, and integers. It’s also fully unicode-enabled, which should make happy a lot of people who are normally worried about UTF-8.

What really attracted me to YAML are some of the key things that drew me to Python: cleanliness and approachability. Too often, I’ve had to deal with monstrous XML files for data passing or — worse yet — configuration and sometimes ini-style configuration files that simply don’t scale, or communicate enough information. So far, I’ve used YAML in about six different projects with great success and found that it scales quite well while staying human-readable.

Syntax is Key

YAML, on its face, is amazingly simple. Take the code below, for example. Run through the pyyaml load function (more on PyYAML in a moment):

 # YAML
name: Jesse

This YAML will get the following Python dictionary:

?View Code PYTHON
1
2
3
4
5
6
7
>>> import yaml
>>> yaml.load("""
...  # YAML
... name: Jesse
... """)
{'name': 'Jesse'}
>>>

This is a simple example. Line 1 of the YAML file, or document, is a simple comment. Note that there is a space character right before that # sign. The next line is a simple key value pair which, after being parsed, gets returned to us in a Python dictionary. Simple as pie!

A simple name-value pair is easy to do. Here is a document with some additional structures and details to try:

 # YAML
object:
    attributes:
        - attr1
        - attr2
        - attr3
    methods: [ getter, setter ]

Here, we have defined a top-level entity named “object”. This object has two block mappings related to it, ”attributes” and ”methods”. The ”attributes” mapping uses the more verbose YAML syntax for a list, in this case:

attributes:
    - attr1
    - attr2
    - attr3

In this case the YAML represents a key with a name of ”attributes” while each item underneath it, prefaced with a “”-””, represents an item that will appear in a list as a value for that key. Here it is printed after a load:

?View Code PYTHON
1
{'object': {'attributes': ['attr1', 'attr2', 'attr3'], ...

The ”methods” key uses YAML shorthand to accomplish the same thing. In my experience, non-programmers tend to understand the first method, “”-”” prefacing, a bit more than the second method. Both parse to Python lists:

?View Code PYTHON
1
2
3
{'object': {'attributes': ['attr1', 'attr2',
                           'attr3'],
            'methods': ['getter', 'setter']}}

I included both examples to illustrate a point. Most of YAML’s syntax has two ways of achieving the same intended goal. There is the verbose, multi-line method, and the more compact method. Both methods are human-readable, so choosing one is a matter of personal preference.

As you can see, the most basic syntax is as follows:

dicts/hashes: key, value separated by a colon and space, e.g. ”key: value”; additionally, you can use ”{key: value}”

lists: dash followed by a space then the item, e.g. ”- item”; additionally, you can use ”[item, item, item]”

Strings do not require quotation. You can preserve line breaks with the ”|” character; for example:

 # YAML
sonnet: |
    I wish I could
    write a poem
    but I can't

This would parse to:

?View Code PYTHON
1
{'sonnet': "I wish I could\nwrite a poem\nbut I can't\n"}

Trailing and preceding whitespace is trimmed out in the basic use case of ”|”. See the “Scalar indicators” section of the compact cheat sheet for modifiers to the ”|” character.

Core to YAML is the concept of documents. A document is not just a separate file in this case. Instead, think of a document as just a chunk of YAML. You can have multiple documents in a single stream of YAML, if each one is separated by ”—”, like:

 # YAML
---
document: this is doc 1
---
document: this is doc 2
...

Using an ellipsis explicitly ends a document. The nice thing about documents is you can treat them as different entities. Let’s say, “people” and “cars” are in the same file. You can use them for a bunch of entities that look alike, e.g.:

name: SomeObject
attributes:
    - attr1
    - attr2
    - attr3
methods: [ getter, setter ]
---
name: MyPrettyObject
attributes:
    - attr1
    - attr2
    - attr3
methods: [ getter, setter ]

which parses to:

?View Code PYTHON
1
2
3
4
5
6
{'attributes': ['attr1', 'attr2', 'attr3'],
 'methods': ['getter', 'setter'],
 'name': 'SomeObject'}
{'attributes': ['attr1', 'attr2', 'attr3'],
 'methods': ['getter', 'setter'],
 'name': 'MyPrettyObject'}

YAML also supports variables, or repeated nodes, which at first didn’t click for me. The simplest explanation is that you define something as a variable by preceding it with ”&NAME value” and you can refer to it with ”*NAME” e.g.:

 # YAML
some_thing: &NAME foobar
other_thing: *NAME

Parses to:

?View Code PYTHON
1
{'other_thing': 'foobar', 'some_thing': 'foobar'}

As you can see, the syntax is pretty simple. It’s easy to represent information in a way that is both clear, concise and, well… fun. What’s really cool is the fact it meshes so well with Python!

Note that fans of JSON (JavaScript Object Notation) will quickly realize that the concise-version of the syntax (e.g. using ”[value, value]”) looks a lot like JSON. In fact, for the most part, JSON is a subset of YAML syntax. With a little bit of additional pre-processing you should be able to pass your JSON off as YAML and vice-versa.

And with that, PyYAML

After reading the basic of the syntax, you’re jazzed to get started with YAML, right? Well, getting started with YAML is only a single ”easy_install” away. The **PyYAML** module is pretty much the de-facto parser and emitter for YAML. The core of the module is written in pure Python, but, as of version 3.0.4, it also supports binding to the high-speed LibYAML implementation written in C.

PyYAML is blindingly simple to use for most cases. To generate all of the output I’ve used in the article so far, all I used was:

?View Code PYTHON
1
2
3
4
import yaml
import pprint
for project in yaml.load_all(open('test.yaml')):
    pprint.pprint(project)

The ”load_all()” function goes back to the “multiple documents within a stream” concept. In the case above I am assuming that there won’t be just a single document. I am using ”yaml.load_all()”, rather than ”load()”, then iterating over the results. ”yaml.load_all()” returns a generator yielding each document in the stream. The ”yaml.load()” function accepts a string (Unicode or otherwise), or an open file object.

For many cases, you’ll be loading a single document. You might use it for configuration loading:

?View Code PYTHON
1
configuration = yaml.load(open('test.yaml').read())

Of course, one of the other aspects to PyYAML is dumping Python data structures to a YAML file. Take, for example, Listing 1:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/python
 
import yaml
 
mydata = {'person' : 'jesse',
          'hobby' : 'python',
          'employed' : True,
          'limbs': {'arms' : 2, 'legs' : 2},
          'family' : ['wife', 'toddler']}
 
print yaml.dump(mydata)

In this case, I am constructing a dictionary containing all of the data I want to include in the YAML file. Then I simply call ”yaml.dump()” and the output of Listing 1 looks like well-formed YAML:

?View Code PYTHON
1
2
3
4
5
6
$ python Listing1.py
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse

Additionally, PyYAML includes ”yaml.dump_all()”. It accepts a list of objects to serialize and writes to the target stream. Let’s make Listing 1 handle a series of objects:

?View Code PYTHON
1
2
mydata = [ mydata for i in range(2) ]
print yaml.dump_all(mydata, explicit_start=True)

And our output is fairly obvious:

---
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse
---
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse

By default, you don’t need to pass additional arguments to ”yaml.dump()” or ”yaml.dump_all()”, as you can see above. In the ”dump_all()” example, I added the ”explicit_start” argument. The dump functions support this flag, along with some others that you should know about, to control formatting.

The ”explicit_start” argument adds the “—” string prior to the data structure being dumped. This allows you to dump multiple objects/documents to the same stream, say, an open file handle, without worrying about the document separators yourself.

Adding the ”default_flow_style” argument changes the output from the default compact style of output, to the more verbose, “humane” output:

?View Code PYTHON
1
print yaml.dump(mydata, default_flow_style=False)

And the output:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
employed: true
family:
- wife
- toddler
hobby: python
limbs:
  arms: 2
  legs: 2
person: jesse

You can also control indenting, width, and so on. You can also switch it to canonical mode, which explicitly defines the type of the value within the YAML:

?View Code PYTHON
1
print yaml.dump(mydata, canonical=True)

And the matching output:

!!map {
  ? !!str "employed"
  : !!bool "true",
  ? !!str "family"
  : !!seq [
    !!str "wife",
    !!str "toddler",
  ],
  ? !!str "hobby"
  : !!str "python",
  ? !!str "limbs"
  : !!map {
    ? !!str "arms"
    : !!int "2",
    ? !!str "legs"
    : !!int "2",
  },
  ? !!str "person"
  : !!str "jesse",
}

Yes, I just jumped the tracks on that last one. YAML and PyYAML both support explicit type declaration within the YAML documents. This is obviously handy for inter-language data exchange, but, as you can see in the output, is not so good on the side of readability if you’re a non-programmer. On the other hand, it allows for a nice segue!

=h=Turning the Awesome Up=h=

We have covered the basics of YAML and, by extension, PyYAML, but PyYAML offers some additional niceties for Python users. Obviously, these advanced features start to edge out approachability, but they are actually really useful.

In the last example of the last section, we turned on the ”canonical” flag to the ”dump” function, which caused it to spit out explicitly typed YAML. Each type was in the format of

''!!''

. These are standard YAML tags, and they’re fully covered in the spec.

Internally, PyYAML converts these tags to the expected Python types. ”!!null” is ”None”, ”!!timestamp” is ”datetime.datetime”, ”!!seq” is ”list”, and so on. You don’t need to explicitly put these in your YAML documents. In most cases the types are inferred from the document, but being able to explicitly define them is handy.

PyYAML can take the ”!!” syntax a bit further though, and adds a series of Python-specific tags which are exceedingly useful. Each one of the Python-specific tags is prefaced with

''!!python/''

. PyYAML defines explicit Python types such as ”float”, ”complex”, ”list”, ”tuple” and ”dict”. In my opinion, the ”tuple” and the ”integer” ones are more useful simply due to the fact that ”dicts” and ”lists” can be derived from the YAML file itself.

However, PyYAML also offers “non-type” ”!!python” extensions. These are referred to as “Complex Python Tags” and they allow you to add things to your YAML document such as Python modules, packages, class instances, and the output of a method call with a passed-in variable.

Say we wanted to have a YAML file which defined some number of variables, but then passed one or more of them to a given module’s method. I wanted something to list the contents of my home directory on parsing:

 # YAML
directory: &DIRECTORY /Users/jesse
contents: !!python/object/apply:os.listdir [*DIRECTORY]

And the abbreviated output:

?View Code PYTHON
1
2
3
4
{'contents': ['.bash_history',
              '.bash_profile',
              'todo.txt'],
 'directory': '/Users/jesse'}

Virtually any function can be called this way. You can also pass in keyword arguments and other data as required. Calling a function, though, is rather easy. Here’s an example YAML file which uses the PyYAML ”new:module.class” tag to create a ”Queue.Queue” at load-time with a defined max size:

qsize: &SIZE 10
queue: !!python/object/new:Queue.Queue {maxsize: *SIZE}

Which, of course, passes you back the correct class instance:

?View Code PYTHON
1
{'qsize': 10, 'queue': <Queue.Queue instance at 0x292fa8>}

In theory, and in my rather abusive practice, this would allow you to define a very rich configuration which constructed all of the relevant objects at parse-time to significantly alter the behavior of the application (or in my case, test) to which the YAML file was passed. One catch when you are using the ”!!python/object/*” tag(s) is that the objects you are creating must be pickle-compatible.

For example, if you tried this:

 # YAML
threadpool:
 - !!python/object/new:threading.Thread
  target: myapp.myfunction

It would fail with an assertion error:

?View Code PYTHON
1
AssertionError: Thread.__init__() was not called

PyYAML is not calling ”__init__()” when creating the object. Both ”yaml.load()” and ”yaml.dump()” are designed to work exactly like ”pickle.load()” and ”pickle.dump()”. Objects must implement the pickle protocol.

Conclusion

YAML and, by extension, PyYAML, are incredibly useful if you want something easy on the eyes, easy to understand, and easy to use in a markup language. It’s straightforward to customize, it’s cross-language, and fundamentally simple. YAML is popping up in all sorts of places, such as the configuration settings for Google’s AppEngine, and in Django, where it is used for a serialization format and to load data fixtures.

Obviously some of the advanced features of PyYAML are Python-specific, but the fundamentals make it an easy win for cross-language communication. Sure, XML does this, too, and there’s support in every known language for XML parsing (including the stuff toddlers speak), but how readable is XML, seriously?

I do hope more and more people adopt this user-friendly format. It’s simply great as a configuration language, and if you need to expose anything to humans and later serialize and deserialize it, just say “no” to XML.

The revolution will be readable.

Requirements:

Related Links

  • vsapre
    Hi Jesse,

    I have been using YAML for a few projects for my users as configuration files and YAML certainly rocks. Interestingly I came to YAML much the same way you've described your journey.

    Just as a side remark, I've found that we read yaml more than we dump it. And so since its a one shot load (there is no SAX approach to reading a YAML file, as yet), 'syck' seems to be better suited for it...as long as you stay with YAML 1.0. Since 'syck' is C, its fast and loading it using 'syck' certainly shows up as compared to using PyYAML to do the same.

    On the other hand, 'dump' is best done using PyYAML, especially because of its dump flags that can help you restore the exact file as it is.

    Thought of sharing this with you.

    BTW: Thanks for the multiprocessing module and your talks at PyCon 2009 were great !!

    Thanks and best regards,
    Vishal Sapre
  • vsapre
    Hi Jesse,

    I have been using YAML for a few projects for my users as configuration files and YAML certainly rocks. Interestingly I came to YAML much the same way you've described your journey.

    Just as a side remark, I've found that we read yaml more than we dump it. And so since its a one shot load (there is no SAX approach to reading a YAML file, as yet), 'syck' seems to be better suited for it...as long as you stay with YAML 1.0. Since 'syck' is C, its fast and loading it using 'syck' certainly shows up as compared to using PyYAML to do the same.

    On the other hand, 'dump' is best done using PyYAML, especially because of its dump flags that can help you restore the exact file as it is.

    Thought of sharing this with you.

    BTW: Thanks for the multiprocessing module and your talks at PyCon 2009 were great !!

    Thanks and best regards,
    Vishal Sapre
  • Doug Farrell
    Hi Jessie,
    Very nice write-up about Yaml, which I'm using. Based on your article in Python Magazine, I started using the syntax you also show above, creating Python objects from the configuration file. I'm even passing them arguments from the Yaml file as you've also shown above.

    I do have one question for you, I'd like to put all my configuration files together into one big file. Is there a way to get Yaml to only read one section (or one document)? What I'm trying to avoid is having all the objects constructed by programs that don't care about or use those objects when reading their own configuration section.

    Thanks in advance!
    Doug
  • Ryan
    Here's a mostly non-material comment for everyone. With an historical understanding that "YA" at the start of an acronym means "Yet Another" I was a bit taken aback when that wasn't the case here. Then when I saw it was just a recursive acronym, I wondered why "Y" was chosen as the first character, as you could choose from any of the wonderful characters available and still have it work. Then I despised it for choosing one ("Y") that when combined with the next character ("A") already had a recognized meaning. But, like I said, mostly non-material.
  • Doug Napoleone
    oops.. Meant to also say:

    Thank you for such a fantastic writeup on using YAML in python. You would be amazed by the number of research projects and research institutions which use YAML as their lingual franca. There was a point in time when our research department was considering going over to XML based interchange formats (custom internal ones). There was quite some steam behind it, until people started developing the actual specs, and implementations. The last straw was when we wanted to make a small change to one of the formats. When properly implemented you need many, many XML files. YAML is makes even the pains of upgrading older formats easier.
  • Doug Napoleone
    While reading this I had the song 'Momma said knock you out', but with the words "don't call it a markup, it's been here for years..." and so on.. And the image of LL Cool J with a set of gold knuckles spelling out YAML.... I won't bore you with the rest....

blog comments powered by Disqus