Chroot and Python discussion and random pyc thoughts
Since I'm finally back from another exciting edition of almost-labor at the hospital and catching up, I thought I would point out a discussion on python-dev about chroot jails and python. Interesting information and tangentially related to some of my thoughts on the .pyc location stuff.
The conversation is going on here and you can view some other information.
If course, there's a shout out to Brett's security work too.
Here's is the wiki page referenced in the thread: How can I run an untrusted Python script safely (i.e. Sandbox)
A lot of the utilities are interesting, but I'm still interested in the byte-code location of things.
Some thoughts on my pyc thing:
- One thing to note, is that if the user running the python interpreter does not have write (+w) access to the directory the imported .py is located in, the .pyc/.pyo file is not written. A .pyc file is an optimization for module loading only.
Since this is the case, is worrying about layered filesystems/storing .pyc files in "other" directories really that hot of an issue for me? Maybe. I'd still like to see if I can get the wherewithal to drive pep 304 forward - I'd still like to be able to control where to put things, but if you use compileall() and ship the .pyo/.pyc stuff *or* you just make sure the daemon that's invoking the interpreter does not have +w on it's script/binary directory (which it shouldn't) you could be ok.


June 28th, 2007 at 2:12 pm
‘Could’ is the operative word here. This is a huge issue for grid environments where multiple versions of python access the same ‘release’ of python code.
The network traffic just to find out that there are no +w permissions on a directory can be huge for micro-tasks on a grid. If there is +w access (as many times there needs to be for sandbox development), then you have all those writes and cross writes, lock misses, race conditions, etc. This is why SEO (Sony Entertainment Online) has their own custom python which does not do .pyc or .pyo at all. We do our own custom hacks, but being able to specify the ‘build’ directory at runtime is a feature us grid folks would love.
As a workaround we have our own special python compile code which ensures the full path to the .py is compiled into the .pyc, then we move the .pyc/.pyo off to a special build directory, and run directly from the .pyc/.pyo’s. This means that the exception stacks are correct, but no .pyc .pyo building occured for ‘released’ versions. This can also be done for development, but it means that an extra ‘build’ step is required. There are other extensive hacks done with custom import hooks to reduce the pythonpath searching which is also network intensive.
Theoretical example: 1000 machines, 8 python processes each, just one network drive on the python path (yea right), and say only 25 modules. search for .so, .pyd, .pyc, .pyw, .py = 1Million network lookups in under a second. Each network operation can be actually up to 12 network operations/transactions using up over 256 bytes each. Total network traffic for just FINDING the .py files (not loading them) in this modest example would be 256Meg/sec.
Just looking for .py files is 1/4th of your theoretical max gigabit backbone. Yes there are ways around this (some described above), but they are not simple or elegant.
So yes, this is a ‘hot’ issue for some people :-)
June 28th, 2007 at 2:15 pm
Nice article.
June 28th, 2007 at 2:24 pm
I had not even though about the grid ramifications on this - I’m in a distributed system, but the cluster is comprised of individual nodes with no shared back end (and even if it *is* shared, it’s fiber to a SAN).
Wanna help with driving 304 forward?
Also, you win for “best comment anywhere ever” award.
June 28th, 2007 at 2:45 pm
Yes I do want to help, but I have some serious backlog I need to resolve first. I don’t want to commit to something until I am sure I can keep that commitment. I will be sending you an e-mail with more details soon.
Thanks for the award, but I don;t think I deserve it :-)
June 28th, 2007 at 4:19 pm
Douglas,
This is really interesting. Is there a chance you discuss it more on your blog or somewhere public? I’m looking at potential large grids like that in the future and would be definitely interested in your experience (module any NDA).
This is even more interesting considering the large amount of projects relying on eggs and setuptools that pollute sys.path with each single egg directory in the path.
Thanks for your share anyhow.
June 28th, 2007 at 4:23 pm
This is even more interesting considering the large amount of projects relying on eggs and setuptools that pollute sys.path with each single egg directory in the path.
Ugh. Don’t remind me about that. Avoiding putting/installing things in the main system library is one of my goals. I’ve started using a sitecustomize.py file that points to /Users/jesse/python/modules and installing everything I can there.
June 28th, 2007 at 5:15 pm
Sylvain,
It’s on my list. I am hoping to submit a talk proposal on it for PyCon2008, but that is so far in the future as to me never.
I will add it to my list of things to blog about. I hope to get pygments integration before then. There are some NDA concerns, but not too many, as long as I stay away from copyrighted code and actual grid configurations. Something geared towards Amazon’s S3 should work well.
NOTE: that should read 256bits, not 256bytes above. The math does not work otherwise.