YAML ain’t Markup Language | Completely Different

April 13th, 2009 § 7 comments

When some­one says “pick a markup lan­guage,” most peo­ple would imme­di­ately respond with “XML!”, but there’s an alter­na­tive out there. YAML is human-readable, easy to use, and over­all quite fantastic.

This is a reprint of an arti­cle I wrote for Python Mag­a­zine as a Com­pletely Dif­fer­ent col­umn that was pub­lished in the Decem­ber 2008 issue. I have repub­lished this in its orig­i­nal form, bugs and all


I hate markup lan­guages. There, I said it. The first time I had the plea­sure of “using” (being abused by) XML, I said to myself “there has got to be a bet­ter way of doing this.” Well, after years of stick­ing with plain text ini files and cus­tom syn­taxes based off of using ”eval()”, I’ve come to not only use, but love, YAML.

YAML, or “YAML Ain’t Markup Lan­guage”, is “a human friendly data seri­al­iza­tion stan­dard for all pro­gram­ming lan­guages.” It has the advan­tage of lean­ing towards dynamic lan­guages a la Python, Ruby, etc.

It is impor­tant to note that friend­li­ness and read­abil­ity are very core to the design of YAML. The num­ber of for­mat char­ac­ters is very low and, like Python, YAML’s markup can use white­space to indi­cate scop­ing of items. Tabs are not allowed, so there is no chance for con­fu­sion about inden­tion level. Addi­tion­ally, the con­structs within YAML such as map­pings, sequences, and scalars all mesh nicely with exist­ing Python data types like dic­tio­nar­ies, lists, strings, and inte­gers. It’s also fully unicode-enabled, which should make happy a lot of peo­ple who are nor­mally wor­ried about UTF-8.

What really attracted me to YAML are some of the key things that drew me to Python: clean­li­ness and approach­a­bil­ity. Too often, I’ve had to deal with mon­strous XML files for data pass­ing or — worse yet — con­fig­u­ra­tion and some­times ini-style con­fig­u­ra­tion files that sim­ply don’t scale, or com­mu­ni­cate enough infor­ma­tion. So far, I’ve used YAML in about six dif­fer­ent projects with great suc­cess and found that it scales quite well while stay­ing human-readable.

Syn­tax is Key

YAML, on its face, is amaz­ingly sim­ple. Take the code below, for exam­ple. Run through the pyyaml load func­tion (more on PyYAML in a moment):

 # YAML
name: Jesse

This YAML will get the fol­low­ing Python dictionary:

?View Code PYTHON
1
2
3
4
5
6
7
>>> import yaml
>>> yaml.load("""
...  # YAML
... name: Jesse
... """)
{'name': 'Jesse'}
>>>

This is a sim­ple exam­ple. Line 1 of the YAML file, or doc­u­ment, is a sim­ple com­ment. Note that there is a space char­ac­ter right before that # sign. The next line is a sim­ple key value pair which, after being parsed, gets returned to us in a Python dic­tio­nary. Sim­ple as pie!

A sim­ple name-value pair is easy to do. Here is a doc­u­ment with some addi­tional struc­tures and details to try:

 # YAML
object:
    attributes:
        - attr1
        - attr2
        - attr3
    methods: [ getter, setter ]

Here, we have defined a top-level entity named “object”. This object has two block map­pings related to it, ”attrib­utes” and ”meth­ods”. The ”attrib­utes” map­ping uses the more ver­bose YAML syn­tax for a list, in this case:

attributes:
    - attr1
    - attr2
    - attr3

In this case the YAML rep­re­sents a key with a name of ”attrib­utes” while each item under­neath it, pref­aced with a “”-””, rep­re­sents an item that will appear in a list as a value for that key. Here it is printed after a load:

?View Code PYTHON
1
{'object': {'attributes': ['attr1', 'attr2', 'attr3'], ...

The ”meth­ods” key uses YAML short­hand to accom­plish the same thing. In my expe­ri­ence, non-programmers tend to under­stand the first method, “”-”” pref­ac­ing, a bit more than the sec­ond method. Both parse to Python lists:

?View Code PYTHON
1
2
3
{'object': {'attributes': ['attr1', 'attr2',
                           'attr3'],
            'methods': ['getter', 'setter']}}

I included both exam­ples to illus­trate a point. Most of YAML’s syn­tax has two ways of achiev­ing the same intended goal. There is the ver­bose, multi-line method, and the more com­pact method. Both meth­ods are human-readable, so choos­ing one is a mat­ter of per­sonal preference.

As you can see, the most basic syn­tax is as follows:

dicts/hashes: key, value sep­a­rated by a colon and space, e.g. ”key: value”; addi­tion­ally, you can use ”{key: value}”

lists: dash fol­lowed by a space then the item, e.g. ”- item”; addi­tion­ally, you can use ”[item, item, item]”

Strings do not require quo­ta­tion. You can pre­serve line breaks with the ”|” char­ac­ter; for example:

 # YAML
sonnet: |
    I wish I could
    write a poem
    but I can't

This would parse to:

?View Code PYTHON
1
{'sonnet': "I wish I could\nwrite a poem\nbut I can't\n"}

Trail­ing and pre­ced­ing white­space is trimmed out in the basic use case of ”|”. See the “Scalar indi­ca­tors” sec­tion of the com­pact cheat sheet for mod­i­fiers to the ”|” character.

Core to YAML is the con­cept of doc­u­ments. A doc­u­ment is not just a sep­a­rate file in this case. Instead, think of a doc­u­ment as just a chunk of YAML. You can have mul­ti­ple doc­u­ments in a sin­gle stream of YAML, if each one is sep­a­rated by ”—”, like:

 # YAML
---
document: this is doc 1
---
document: this is doc 2
...

Using an ellip­sis explic­itly ends a doc­u­ment. The nice thing about doc­u­ments is you can treat them as dif­fer­ent enti­ties. Let’s say, “peo­ple” and “cars” are in the same file. You can use them for a bunch of enti­ties that look alike, e.g.:

name: SomeObject
attributes:
    - attr1
    - attr2
    - attr3
methods: [ getter, setter ]
---
name: MyPrettyObject
attributes:
    - attr1
    - attr2
    - attr3
methods: [ getter, setter ]

which parses to:

?View Code PYTHON
1
2
3
4
5
6
{'attributes': ['attr1', 'attr2', 'attr3'],
 'methods': ['getter', 'setter'],
 'name': 'SomeObject'}
{'attributes': ['attr1', 'attr2', 'attr3'],
 'methods': ['getter', 'setter'],
 'name': 'MyPrettyObject'}

YAML also sup­ports vari­ables, or repeated nodes, which at first didn’t click for me. The sim­plest expla­na­tion is that you define some­thing as a vari­able by pre­ced­ing it with ”&NAME value” and you can refer to it with ”*NAME” e.g.:

 # YAML
some_thing: &NAME foobar
other_thing: *NAME

Parses to:

?View Code PYTHON
1
{'other_thing': 'foobar', 'some_thing': 'foobar'}

As you can see, the syn­tax is pretty sim­ple. It’s easy to rep­re­sent infor­ma­tion in a way that is both clear, con­cise and, well… fun. What’s really cool is the fact it meshes so well with Python!

Note that fans of JSON (JavaScript Object Nota­tion) will quickly real­ize that the concise-version of the syn­tax (e.g. using ”[value, value]”) looks a lot like JSON. In fact, for the most part, JSON is a sub­set of YAML syn­tax. With a lit­tle bit of addi­tional pre-processing you should be able to pass your JSON off as YAML and vice-versa.

And with that, PyYAML

After read­ing the basic of the syn­tax, you’re jazzed to get started with YAML, right? Well, get­ting started with YAML is only a sin­gle ”easy_install” away. The **PyYAML** mod­ule is pretty much the de-facto parser and emit­ter for YAML. The core of the mod­ule is writ­ten in pure Python, but, as of ver­sion 3.0.4, it also sup­ports bind­ing to the high-speed LibYAML imple­men­ta­tion writ­ten in C.

PyYAML is blind­ingly sim­ple to use for most cases. To gen­er­ate all of the out­put I’ve used in the arti­cle so far, all I used was:

?View Code PYTHON
1
2
3
4
import yaml
import pprint
for project in yaml.load_all(open('test.yaml')):
    pprint.pprint(project)

The ”load_all()” func­tion goes back to the “mul­ti­ple doc­u­ments within a stream” con­cept. In the case above I am assum­ing that there won’t be just a sin­gle doc­u­ment. I am using ”yaml.load_all()”, rather than ”load()”, then iter­at­ing over the results. ”yaml.load_all()” returns a gen­er­a­tor yield­ing each doc­u­ment in the stream. The ”yaml.load()” func­tion accepts a string (Uni­code or oth­er­wise), or an open file object.

For many cases, you’ll be load­ing a sin­gle doc­u­ment. You might use it for con­fig­u­ra­tion loading:

?View Code PYTHON
1
configuration = yaml.load(open('test.yaml').read())

Of course, one of the other aspects to PyYAML is dump­ing Python data struc­tures to a YAML file. Take, for exam­ple, List­ing 1:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/python
 
import yaml
 
mydata = {'person' : 'jesse',
          'hobby' : 'python',
          'employed' : True,
          'limbs': {'arms' : 2, 'legs' : 2},
          'family' : ['wife', 'toddler']}
 
print yaml.dump(mydata)

In this case, I am con­struct­ing a dic­tio­nary con­tain­ing all of the data I want to include in the YAML file. Then I sim­ply call ”yaml.dump()” and the out­put of List­ing 1 looks like well-formed YAML:

?View Code PYTHON
1
2
3
4
5
6
$ python Listing1.py
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse

Addi­tion­ally, PyYAML includes ”yaml.dump_all()”. It accepts a list of objects to seri­al­ize and writes to the tar­get stream. Let’s make List­ing 1 han­dle a series of objects:

?View Code PYTHON
1
2
mydata = [ mydata for i in range(2) ]
print yaml.dump_all(mydata, explicit_start=True)

And our out­put is fairly obvious:

---
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse
---
employed: true
family: [wife, toddler]
hobby: python
limbs: {arms: 2, legs: 2}
person: jesse

By default, you don’t need to pass addi­tional argu­ments to ”yaml.dump()” or ”yaml.dump_all()”, as you can see above. In the ”dump_all()” exam­ple, I added the ”explicit_start” argu­ment. The dump func­tions sup­port this flag, along with some oth­ers that you should know about, to con­trol formatting.

The ”explicit_start” argu­ment adds the “—” string prior to the data struc­ture being dumped. This allows you to dump mul­ti­ple objects/documents to the same stream, say, an open file han­dle, with­out wor­ry­ing about the doc­u­ment sep­a­ra­tors yourself.

Adding the ”default_flow_style” argu­ment changes the out­put from the default com­pact style of out­put, to the more ver­bose, “humane” output:

?View Code PYTHON
1
print yaml.dump(mydata, default_flow_style=False)

And the output:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
employed: true
family:
- wife
- toddler
hobby: python
limbs:
  arms: 2
  legs: 2
person: jesse

You can also con­trol indent­ing, width, and so on. You can also switch it to canon­i­cal mode, which explic­itly defines the type of the value within the YAML:

?View Code PYTHON
1
print yaml.dump(mydata, canonical=True)

And the match­ing output:

!!map {
  ? !!str "employed"
  : !!bool "true",
  ? !!str "family"
  : !!seq [
    !!str "wife",
    !!str "toddler",
  ],
  ? !!str "hobby"
  : !!str "python",
  ? !!str "limbs"
  : !!map {
    ? !!str "arms"
    : !!int "2",
    ? !!str "legs"
    : !!int "2",
  },
  ? !!str "person"
  : !!str "jesse",
}

Yes, I just jumped the tracks on that last one. YAML and PyYAML both sup­port explicit type dec­la­ra­tion within the YAML doc­u­ments. This is obvi­ously handy for inter-language data exchange, but, as you can see in the out­put, is not so good on the side of read­abil­ity if you’re a non-programmer. On the other hand, it allows for a nice segue!

=h=Turning the Awe­some Up=h=

We have cov­ered the basics of YAML and, by exten­sion, PyYAML, but PyYAML offers some addi­tional niceties for Python users. Obvi­ously, these advanced fea­tures start to edge out approach­a­bil­ity, but they are actu­ally really useful.

In the last exam­ple of the last sec­tion, we turned on the ”canon­i­cal” flag to the ”dump” func­tion, which caused it to spit out explic­itly typed YAML. Each type was in the for­mat of

''!!''

. These are stan­dard YAML tags, and they’re fully cov­ered in the spec.

Inter­nally, PyYAML con­verts these tags to the expected Python types. ”!!null” is ”None”, ”!!time­stamp” is ”datetime.datetime”, ”!!seq” is ”list”, and so on. You don’t need to explic­itly put these in your YAML doc­u­ments. In most cases the types are inferred from the doc­u­ment, but being able to explic­itly define them is handy.

PyYAML can take the ”!!” syn­tax a bit fur­ther though, and adds a series of Python-specific tags which are exceed­ingly use­ful. Each one of the Python-specific tags is pref­aced with

''!!python/''

. PyYAML defines explicit Python types such as ”float”, ”com­plex”, ”list”, ”tuple” and ”dict”. In my opin­ion, the ”tuple” and the ”inte­ger” ones are more use­ful sim­ply due to the fact that ”dicts” and ”lists” can be derived from the YAML file itself.

How­ever, PyYAML also offers “non-type” ”!!python” exten­sions. These are referred to as “Com­plex Python Tags” and they allow you to add things to your YAML doc­u­ment such as Python mod­ules, pack­ages, class instances, and the out­put of a method call with a passed-in variable.

Say we wanted to have a YAML file which defined some num­ber of vari­ables, but then passed one or more of them to a given module’s method. I wanted some­thing to list the con­tents of my home direc­tory on parsing:

 # YAML
directory: &DIRECTORY /Users/jesse
contents: !!python/object/apply:os.listdir [*DIRECTORY]

And the abbre­vi­ated output:

?View Code PYTHON
1
2
3
4
{'contents': ['.bash_history',
              '.bash_profile',
              'todo.txt'],
 'directory': '/Users/jesse'}

Vir­tu­ally any func­tion can be called this way. You can also pass in key­word argu­ments and other data as required. Call­ing a func­tion, though, is rather easy. Here’s an exam­ple YAML file which uses the PyYAML ”new:module.class” tag to cre­ate a ”Queue.Queue” at load-time with a defined max size:

qsize: &SIZE 10
queue: !!python/object/new:Queue.Queue {maxsize: *SIZE}

Which, of course, passes you back the cor­rect class instance:

?View Code PYTHON
1
{'qsize': 10, 'queue': <Queue.Queue instance at 0x292fa8>}

In the­ory, and in my rather abu­sive prac­tice, this would allow you to define a very rich con­fig­u­ra­tion which con­structed all of the rel­e­vant objects at parse-time to sig­nif­i­cantly alter the behav­ior of the appli­ca­tion (or in my case, test) to which the YAML file was passed. One catch when you are using the ”!!python/object/*” tag(s) is that the objects you are cre­at­ing must be pickle-compatible.

For exam­ple, if you tried this:

 # YAML
threadpool:
 - !!python/object/new:threading.Thread
  target: myapp.myfunction

It would fail with an asser­tion error:

?View Code PYTHON
1
AssertionError: Thread.__init__() was not called

PyYAML is not call­ing ”__init__()” when cre­at­ing the object. Both ”yaml.load()” and ”yaml.dump()” are designed to work exactly like ”pickle.load()” and ”pickle.dump()”. Objects must imple­ment the pickle protocol.

Con­clu­sion

YAML and, by exten­sion, PyYAML, are incred­i­bly use­ful if you want some­thing easy on the eyes, easy to under­stand, and easy to use in a markup lan­guage. It’s straight­for­ward to cus­tomize, it’s cross-language, and fun­da­men­tally sim­ple. YAML is pop­ping up in all sorts of places, such as the con­fig­u­ra­tion set­tings for Google’s AppEngine, and in Django, where it is used for a seri­al­iza­tion for­mat and to load data fixtures.

Obvi­ously some of the advanced fea­tures of PyYAML are Python-specific, but the fun­da­men­tals make it an easy win for cross-language com­mu­ni­ca­tion. Sure, XML does this, too, and there’s sup­port in every known lan­guage for XML pars­ing (includ­ing the stuff tod­dlers speak), but how read­able is XML, seriously?

I do hope more and more peo­ple adopt this user-friendly for­mat. It’s sim­ply great as a con­fig­u­ra­tion lan­guage, and if you need to expose any­thing to humans and later seri­al­ize and dese­ri­al­ize it, just say “no” to XML.

The rev­o­lu­tion will be readable.

Require­ments:

Related Links

  • Doug Napoleone

    While read­ing this I had the song ‘Momma said knock you out’, but with the words “don’t call it a markup, it’s been here for years…” and so on.. And the image of LL Cool J with a set of gold knuck­les spelling out YAML.… I won’t bore you with the rest.…

  • Doug Napoleone

    oops.. Meant to also say:

    Thank you for such a fan­tas­tic writeup on using YAML in python. You would be amazed by the num­ber of research projects and research insti­tu­tions which use YAML as their lin­gual franca. There was a point in time when our research depart­ment was con­sid­er­ing going over to XML based inter­change for­mats (cus­tom inter­nal ones). There was quite some steam behind it, until peo­ple started devel­op­ing the actual specs, and imple­men­ta­tions. The last straw was when we wanted to make a small change to one of the for­mats. When prop­erly imple­mented you need many, many XML files. YAML is makes even the pains of upgrad­ing older for­mats easier.

  • Ryan

    Here’s a mostly non-material com­ment for every­one. With an his­tor­i­cal under­stand­ing that “YA” at the start of an acronym means “Yet Another” I was a bit taken aback when that wasn’t the case here. Then when I saw it was just a recur­sive acronym, I won­dered why “Y” was cho­sen as the first char­ac­ter, as you could choose from any of the won­der­ful char­ac­ters avail­able and still have it work. Then I despised it for choos­ing one (“Y”) that when com­bined with the next char­ac­ter (“A”) already had a rec­og­nized mean­ing. But, like I said, mostly non-material.

  • Doug Far­rell

    Hi Jessie,
    Very nice write-up about Yaml, which I’m using. Based on your arti­cle in Python Mag­a­zine, I started using the syn­tax you also show above, cre­at­ing Python objects from the con­fig­u­ra­tion file. I’m even pass­ing them argu­ments from the Yaml file as you’ve also shown above.

    I do have one ques­tion for you, I’d like to put all my con­fig­u­ra­tion files together into one big file. Is there a way to get Yaml to only read one sec­tion (or one doc­u­ment)? What I’m try­ing to avoid is hav­ing all the objects con­structed by pro­grams that don’t care about or use those objects when read­ing their own con­fig­u­ra­tion section.

    Thanks in advance!
    Doug

  • vsapre

    Hi Jesse,

    I have been using YAML for a few projects for my users as con­fig­u­ra­tion files and YAML cer­tainly rocks. Inter­est­ingly I came to YAML much the same way you’ve described your journey.

    Just as a side remark, I’ve found that we read yaml more than we dump it. And so since its a one shot load (there is no SAX approach to read­ing a YAML file, as yet), ‘syck’ seems to be bet­ter suited for it…as long as you stay with YAML 1.0. Since ‘syck’ is C, its fast and load­ing it using ‘syck’ cer­tainly shows up as com­pared to using PyYAML to do the same.

    On the other hand, ‘dump’ is best done using PyYAML, espe­cially because of its dump flags that can help you restore the exact file as it is.

    Thought of shar­ing this with you.

    BTW: Thanks for the mul­ti­pro­cess­ing mod­ule and your talks at PyCon 2009 were great !!

    Thanks and best regards,
    Vishal Sapre

  • vsapre

    Hi Jesse,

    I have been using YAML for a few projects for my users as con­fig­u­ra­tion files and YAML cer­tainly rocks. Inter­est­ingly I came to YAML much the same way you’ve described your journey.

    Just as a side remark, I’ve found that we read yaml more than we dump it. And so since its a one shot load (there is no SAX approach to read­ing a YAML file, as yet), ‘syck’ seems to be bet­ter suited for it…as long as you stay with YAML 1.0. Since ‘syck’ is C, its fast and load­ing it using ‘syck’ cer­tainly shows up as com­pared to using PyYAML to do the same.

    On the other hand, ‘dump’ is best done using PyYAML, espe­cially because of its dump flags that can help you restore the exact file as it is.

    Thought of shar­ing this with you.

    BTW: Thanks for the mul­ti­pro­cess­ing mod­ule and your talks at PyCon 2009 were great !!

    Thanks and best regards,
    Vishal Sapre

  • vsapre

    Hi Jesse,

    I have been using YAML for a few projects for my users as con­fig­u­ra­tion files and YAML cer­tainly rocks. Inter­est­ingly I came to YAML much the same way you’ve described your journey.

    Just as a side remark, I’ve found that we read yaml more than we dump it. And so since its a one shot load (there is no SAX approach to read­ing a YAML file, as yet), ‘syck’ seems to be bet­ter suited for it…as long as you stay with YAML 1.0. Since ‘syck’ is C, its fast and load­ing it using ‘syck’ cer­tainly shows up as com­pared to using PyYAML to do the same.

    On the other hand, ‘dump’ is best done using PyYAML, espe­cially because of its dump flags that can help you restore the exact file as it is.

    Thought of shar­ing this with you.

    BTW: Thanks for the mul­ti­pro­cess­ing mod­ule and your talks at PyCon 2009 were great !!

    Thanks and best regards,
    Vishal Sapre

What's this?

You are currently reading YAML ain’t Markup Language | Completely Different at jessenoller.com.

meta