February 7, 2012

Python Memoize Decorator with TTL Argument

Filed under: blogging,linux,python — jonEbird @ 8:51 pm

I have been working on a troubleshooting effort on and off over the past couple of weeks. In the process of troubleshooting, I ended up writing a script to query and report on un-read buffered data for sockets across the system and today I wanted to correlate the sockets to a particular set of processes. That turned out to be not so bad knowing I could look down /proc/<pid>/fd/ for a list of the current file descriptors the particular process has open. Here is my winning Python implementation.

def get_pid_socket_nodes(pid):
    """Return a list of socket device numbers for the given process ID (pid)
    pid may be an integer or a string
    Akin to: ls -l /proc/
/fd/ | sed -n 's/^.*socket:\[\([0-9]*\)\]$/\1/p'
    nodes = []
    for fd in os.listdir('/proc/%s/fd' % pid):
        link = os.readlink('/proc/%s/fd/%s' % (pid, fd))
        if link.startswith('socket:['):
    return nodes

What I haven’t told you is the overall program may be scanning /proc/net/{tcp,udp} a lot and with each scan I’ll want to correlate values found with a particular process’s sockets. That means a straight forward implementation could mean calling my helper function get_socket_devices() at each interval for each process of interest. Also, the interval is currently controlled by a sleep interval specified by the caller. If the caller wants to monitor for socket information at an 0.1s sleep interval for five processes, I’ll be scanning /proc/<pid>/fd/ entries 50 times a second. Sure, it won’t crash the machine but it’s certainly a waste of resources assuming the processes you are interested in aren’t closing and opening sockets at an alarming rate.

I wanted to reduce the amount of /proc/<pid>/fd/ scans but I also did not want to clutter the code. “Ah ha, what I want is a memoize decorator”, I told myself, but I want to be able to specify a time-to-live (ttl) value to my decorator. Admittedly, I have only ever needed to write decorators without arguments, so this was a first for me. I also don’t write enough decorators to be able to write one correctly without looking at a sample implementation for a refresher. My implementation is basically a merge between Memoize and Cached Properties from the Python Decorator Library wiki page. (I also spent some time re-reading Bruce Eckel’s Decorator Arguments writeup as well as Elf Sternberg’s Decorators With Arguments writeup.)

class memoized_ttl(object):
    """Decorator that caches a function's return value each time it is called within a TTL
    If called within the TTL and the same arguments, the cached value is returned,
    If called outside the TTL or a different value, a fresh value is returned.
    def __init__(self, ttl):
        self.cache = {}
        self.ttl = ttl
    def __call__(self, f):
        def wrapped_f(*args):
            now = time.time()
                value, last_update = self.cache[args]
                if self.ttl > 0 and now - last_update > self.ttl:
                    raise AttributeError
                #print 'DEBUG: cached value'
                return value
            except (KeyError, AttributeError):
                value = f(*args)
                self.cache[args] = (value, now)
                #print 'DEBUG: fresh value'
                return value
            except TypeError:
                # uncachable -- for instance, passing a list as an argument.
                # Better to not cache than to blow up entirely.
                return f(*args)
        return wrapped_f

Now let’s put the decorator into use and test it. (My testing is uncommenting the “print DEBUG” lines from the memoized_ttl decorator.) To use the decorator, you use the standard decorator calling syntax and specify your desired ttl. I am restricting the /proc/<pid>/fd/ scans to a minute interval which, based on my experience, will drop the load of the script down to below the radar.

def get_pid_socket_nodes(pid):
    """Return a list of socket device numbers for the given process ID (pid)
    pid may be an integer or a string
    Akin to: ls -l /proc/
/fd/ | sed -n 's/^.*socket:\[\([0-9]*\)\]$/\1/p'
    nodes = []
    for fd in os.listdir('/proc/%s/fd' % pid):
        link = os.readlink('/proc/%s/fd/%s' % (pid, fd))
        if link.startswith('socket:['):
    return nodes

And finally, here is what it looks like looking at a couple of processes of mine with open sockets:

>>> from buffered_sockets import *
>>> pid = 23283
>>> pid2 = 23279
>>> get_socket_devices(pid)
DEBUG: fresh value
>>> get_socket_devices(pid)
DEBUG: cached value
>>> get_socket_devices(pid2)
DEBUG: fresh value
>>> get_socket_devices(pid2)
DEBUG: cached value
>>> time.sleep(60)
>>> get_socket_devices(pid)
DEBUG: fresh value
>>> get_socket_devices(pid)
DEBUG: cached value

A handy decorator to keep around which I suspect will get a lot of mileage from other scripts I end up authoring. Aside from the helpful blog posts listed above, I also found Will McGugan’s Timed Caching Decorator page after writing this entire blog post. Apparently I need to improve my google searching skills because I certainly didn’t want to re-invent what others have already completed. On the other hand, if you spend this much time creating something for the first time, you tend to remember the lessons better. I’ll take that.

December 23, 2011

Installing Pithos on Fedora within a Virtualenv

Filed under: adminstration,blogging,linux,python,usability — jonEbird @ 12:40 pm

I listen to a lot of music while at home. I am a Pandora user and have been very happy with my Pandora One subscription now for over two years. The machine used for playing my music is what I call my “media PC”. It is called that because this machine sits in my entertainment stand and is connected to my Sony receiver via HDMI making the multimedia experience as good as I can get. If you put those two facts together, you can see that I am staring at my desktop a lot and I thought it would be nice to integrate my TV into rest of the decor of the house. I primarily do that by being very selective in finding desktop pictures and generally clearing off the desktop of any clutter. Think of the large 47″ LCD television as one big painting for the living room.

Which leads me to my one, sole problem with Pandora: I like to look up and read the Artist and Title of the track being played but I don’t want the browser to also consume my visual space. (I also don’t want to mess around with Adobe Air for the desktop version of Pandora) Enter Pithos. By this point, I should point out that my media PC is running Fedora Core 15 and I’m a Gnome user (let’s not talk about Gnome3). That is important because Pithos was written for gnome users.

Pithos is great. It’s a simple UI design, still allows for normal Pandora song control, easy drop-down for my stations, can still star (thumb’s up) songs all the while being small and unobtrusive. And now we are to the subject of this blog post: Installing Pithos on a Fedora Core machine.

This installation guide will follow my other guides in the same “copy & paste” format. That is, below you should be able to simply open a shell, copy the block of shell code and paste it into your terminal and be ready to launch Pithos. The one configurable item I left in there is whether or not you’d like to install Pithos within a virtualenv or not. I won’t go into detail about what virtualenv is for this discussion, but suffice to say that you’d choose it if you want to install Pithos in a alternative path that you own instead of /usr/local/bin/. Below, when you copy & paste the instructions to install Pithos, you can simply leave out the variable "I_LOVE_VIRTUALENV" or change the value from anything but “yes” to install the “normal” way. I choose to install via virtualenv to 1. keep my system site-packages clean and 2. also keep /usr/local uncluttered. When I do this, I mostly only have to worry about backing up my home directory between rebuilds.

Again: If you’d like to use virtualenv, keep the "I_LOVE_VIRTUALENV" variable set to “yes”.
Furthermore, using virtualenv you can control the env path via setting the VIRTUALENV variable. Some people have a separate directory for their virtualenv’s. E.g. VIRTUALENV=virtualenvs/pithos
(Copy and paste away!)

# Keep this variable to install within a virtualenv.
#   otherwise, skip this line or change from "yes" to anything else.
VIRTUALENV="" # Set this to control where your virtualenv is created
# --- Rest is pure copy & paste gold ---
sudo yum -y install python pyxdg pygobject2 \
  gstreamer-python notify-python pygtk2 dbus-python \
  gstreamer-plugins-good gstreamer-plugins-bad \
  bzr python-virtualenv
# FYI, those last two are not direct requirements but tools to complete this
cd; bzr branch lp:pithos pithos
if [ "${I_LIKE_VIRTUALENV}" == "yes" ]; then
  virtualenv ${VIRTUALENV:-pithos_venv}
  source ${VIRTUALENV:-pithos_venv}/bin/activate
  # The money shot... finger's crossed
  cd pithos; python install
  cd pithos; sudo python install --prefix=/usr/local

And there you have it. A clean, aesthetically pleasing music experience. Enjoy.
Desktop Shot with Pithos

November 15, 2010

Socket Option Defaults

Filed under: linux,python — jonEbird @ 10:02 pm

Working closely with the operating system, as an engineer or administrator, you often get odd questions about what particular OS settings were used. Often times, the oddest questions come from application owners which don’t have a solid handle on their app and are looking for excuses for why their application is misbehaving. Naturally their knowledge of the operating system is equally lacking if not more so.

Today’s question: What is the default OS setting for the SO_LINGER socket option?

I started off by explaining that there were no operating system configuration files where you go and adjust default socket option values and that if they were concerned with how the specific SO_LINGER option was being used that they need to keep their focus on the application. Should they be concerned with particular values being set, it’s going to be the application applying that setting via the setsockopt() system call. Their application is Java based running on top of a application server, so there are several layers of abstraction involved here. I do not mean to piss off any Java developers here, but more times than not they are not intimate with the lower level interactions of their JVMs within the OS.

Having adequately quelled that line of questioning, in terms of troubleshooting their application, I started to think why not go ahead and produce the values for all of the socket options? How about a python script for the answer?

#!/usr/bin/env python

import socket

s = socket.socket()
socket_options = [ (getattr(socket, opt), opt) for opt in dir(socket) if opt.startswith('SO_') ]
for num, opt in socket_options:
        val = s.getsockopt(socket.SOL_SOCKET, num)
        print '%s(%d) defaults to %d' % (opt, num, val)
    except (socket.error), e:
        print '%s(%d) can\'t help you out there: %s' % (opt, num, str(e))

Running that on my Fedora Core 13 build, I get:

$ ./
SO_DEBUG(1) defaults to 0
SO_REUSEADDR(2) defaults to 0
SO_TYPE(3) defaults to 1
SO_ERROR(4) defaults to 0
SO_DONTROUTE(5) defaults to 0
SO_BROADCAST(6) defaults to 0
SO_SNDBUF(7) defaults to 16384
SO_RCVBUF(8) defaults to 87380
SO_KEEPALIVE(9) defaults to 0
SO_OOBINLINE(10) defaults to 0
SO_NO_CHECK(11) defaults to 0
SO_PRIORITY(12) defaults to 0
SO_LINGER(13) defaults to 0
SO_BSDCOMPAT(14) defaults to 0
SO_PASSCRED(16) defaults to 0
SO_PEERCRED(17) defaults to 0
SO_RCVLOWAT(18) defaults to 1
SO_SNDLOWAT(19) defaults to 1
SO_RCVTIMEO(20) defaults to 0
SO_SNDTIMEO(21) defaults to 0
SO_SECURITY_AUTHENTICATION(22) can't help you out there: [Errno 92] Protocol not available
SO_SECURITY_ENCRYPTION_TRANSPORT(23) can't help you out there: [Errno 92] Protocol not available
SO_SECURITY_ENCRYPTION_NETWORK(24) can't help you out there: [Errno 92] Protocol not available
SO_BINDTODEVICE(25) can't help you out there: [Errno 92] Protocol not available
SO_ATTACH_FILTER(26) can't help you out there: [Errno 92] Protocol not available
SO_DETACH_FILTER(27) can't help you out there: [Errno 92] Protocol not available
SO_PEERNAME(28) can't help you out there: [Errno 107] Transport endpoint is not connected
SO_TIMESTAMP(29) defaults to 0
SO_ACCEPTCONN(30) defaults to 0
SO_PEERSEC(31) can't help you out there: [Errno 34] Numerical result out of range
SO_SNDBUFFORCE(32) can't help you out there: [Errno 92] Protocol not available
SO_RCVBUFFORCE(33) can't help you out there: [Errno 92] Protocol not available
SO_PASSSEC(34) defaults to 0
SO_TIMESTAMPNS(35) defaults to 0

And I like to show the actual system calls being performed since I didn’t write the program in C.

$ strace -vall -f ./ 2>&1 | egrep '^(socket|getsock|setsock)'
getsockopt(3, SOL_SOCKET, SO_DEBUG, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_DONTROUTE, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_BROADCAST, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_OOBINLINE, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_NO_CHECK, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_PRIORITY, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_LINGER, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_BSDCOMPAT, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_PEERCRED, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_RCVLOWAT, [1], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_SNDLOWAT, [1], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_RCVTIMEO, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_SNDTIMEO, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_SECURITY_AUTHENTICATION, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, SO_SECURITY_ENCRYPTION_TRANSPORT, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, SO_SECURITY_ENCRYPTION_NETWORK, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, SO_DETACH_FILTER, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, SO_PEERNAME, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOTCONN (Transport endpoint is not connected)
getsockopt(3, SOL_SOCKET, SO_TIMESTAMP, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, SO_PEERSEC, 0xbfd0fb0c, 0xbfd0fb08) = -1 ERANGE (Numerical result out of range)
getsockopt(3, SOL_SOCKET, 0x20 /* SO_??? */, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, 0x21 /* SO_??? */, 0xbfd0fb0c, 0xbfd0fb08) = -1 ENOPROTOOPT (Protocol not available)
getsockopt(3, SOL_SOCKET, 0x22 /* SO_??? */, [0], [4]) = 0
getsockopt(3, SOL_SOCKET, 0x23 /* SO_??? */, [0], [4]) = 0

That’s a whole lot of answers when I just needed to say, the operating system doesn’t automatically apply a SO_LINGER value by default on your newly created sockets but it was fun.

February 9, 2010

Deciphering Caught Signals

Filed under: adminstration,linux,python — jonEbird @ 6:49 pm

Have you ever wondered which signal handlers a particular process has registered? A friend of mine was observing different behavior when spawning a new process from his Python script vs. invoking the command in the shell. Actually, he was consulting me about finding the best way to shutdown the process after spawning it from his Python script. You see, the program is actually just a shell wrapper which then kicks off the real program. His program would learn the process id (pid) of the wrapper and trying to send a kill signal to that was effectively terminating the wrapper and leaving the actual program running. By comparison, I asked him what happens in the shell when he tries to kill the program. Unlike being spawned in the Python script, this time the program and wrapper together would shutdown cleanly. My initial question was, “Are there different signal handlers being caught between the two scenarios?” He wasn’t sure and our dialog afterwards is what I’d like to explain to you now.

A pretty straight forward way to query what signal handlers a process has is to use “ps”. Let’s use my shell as an example:

$ ps -o pid,user,comm,caught -p $$
  PID USER     COMMAND                   CAUGHT
 3508 jon      bash            000000004b813efb

My shell is currently catching the signals being represented by the signal mask of 0x000000004b813efb. Pretty straight forward, right? Yeah, unless you havn’t done much C programming like my friend. He was not used to seeing hexadecimal numbers where each bit represents a on/off flag for each available signal. To follow along, make sure you understand binary representation of numbers first and learn that our number 0x000000004b813efb is represented in binary as 01001011100000010011111011111011. Now viewing that number and reading from right (least significant bit) to left, note which nth bit has a one or not. You can see that it is the 1st, 2nd, 4th, 5th, etc. Now all we have to do is associate those place holders with the signals they represent. Easiest way to see which numeric values are assigned to which signals is to use the “kill” command:

$ kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
 5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
 9) SIGKILL     10) SIGUSR1     11) SIGSEGV     12) SIGUSR2
13) SIGPIPE     14) SIGALRM     15) SIGTERM     16) SIGSTKFLT
17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU
29) SIGIO       30) SIGPWR      31) SIGSYS      34) SIGRTMIN

Armed with this knowledge, you can now provide a human readable report for which signals my shell is capturing: It has signal handlers setup for SIGHUP(1), SIGINT(2), SIGILL(4), SIGTRAP(5), etc.

A quick note about signal handlers. A signal handler is basically a jump location for your program to goto after receiving a particular signal. Think of it as an asynchronous function call, or more succinctly as a callback. That is, your program’s execution will jump to the function you’ve registered for your signal handler immediately upon receiving said signal and it does not matter where in your program’s execution you are currently at. Since the call is asynchronous, a lot of people will have a signal handler merely toggle a global flag and let their program resume it’s processing and check on that flag at a more convenient time.

Now that we know how to see which signals are being caught by a program, and what signal handlers are, let’s create a new signal handler for my shell and note the changed signal mask. Again, reviewing my currently caught signals, I notice I’m not doing anything for the 3rd signal of SIGQUIT. I want to assign a signal handler on this signal so we can see the changed signal mask. I’m going to have the shell execute a simple function upon receipt of the SIGQUIT signal.

$ function sayhi { echo "hi there"; }
$ trap sayhi 3
$ trap sayhi SIGQUIT # same thing as the number 3
$ kill -QUIT $$
hi there

Now, how about our signal mask. Has it changed?

$ ps -o pid,user,comm,caught -p $$
  PID USER     COMMAND                   CAUGHT
 3508 jon      bash            000000004b813eff

The signal mask has changed from 0x000000004b813efb to 0x000000004b813eff. The new signal mask, converting from hexadecimal to binary, is 1001011100000010011111011111111. Notice how our 3rd bit from the right is now a “1″ and before it was “0″.

Understanding how the signal masks are represented is good, but it’s still a pain if you want to quickly compare the signals being caught between two different processes. Per that point, I created a little Python script to do the work for me:

#!/bin/env python

import sys, signal

def dec2bin(N):
    binary = ''
    while N:
        N, r = divmod(N,2)
        binary = str(r) + binary
    return binary

def sigmask(binary):
    """Take a string representation of a binary number and return the signals associated with each bit.
       E.g. '10101' => ['SIGHUP','SIGQUIT','SIGTRAP']
            This is because SIGHUP is 1, SIGQUIT is 3 and SIGTRAP is 5
    sigmap = dict([ (getattr(signal, sig), sig) for sig in dir(signal) if (sig.startswith('SIG') and '_' not in sig) ])
    signals = [ sigmap.get(n+1,str(n+1)) for n, bit in enumerate(reversed(binary)) if bit == '1' ]
    return signals

if __name__ == '__main__':

    if sys.argv[1].startswith('0x'):
        N = int(sys.argv[1], 16)
        N = int(sys.argv[1])

    binstr = dec2bin(N)
    print '"%s" (0x%x,%d) => %s; %s' % (sys.argv[1], N, N, binstr, ','.join(sigmask(binstr)) )

To use the my program, copy it to a file, make it executable and run it passing the signal mask of your program.

$ wget -O ~/bin/
$ chmod 755 ~/bin/ # assuming ~/bin is in your PATH
$ "0x$(ps --no-headers -o caught -p $$)"
"0x000000004b813eff" (0x4b813eff,1266761471) => 1001011100000010011111011111111;

Now back to my friend and his program problem. I asked him to fire off the program both from his Python script and then again directly from the shell. Each time I asked him to check on the caught signal mask of both the wrapper program and the actual binary and report the signal masks to me. As for the wrapper, it was consistently catching only SIGINT and SIGCLD, but the story was not as clear for the binary.
When kicked off via Python, the binary was catching the following signals:


whereas when invoked directly from the shell, the binary was catching:


Initially, I thought, “Ah ha, see it’s catching SIGINT in addition to the other signals when invoked from the shell!”, but quelled my excitement as I realized it didn’t help to explain why both wrapper and binary were both shutting down in the shell. If you sent a SIGINT to the wrapper via “kill -INT <wrapperpid>” nothing happens. Any other signal that the wrapper was not catching, such as SIGTERM (which is the default send via “kill” when you do not specifiy a signal), would cause the wrapper to terminate and orphan the binary to remain running.

The explanation lies within the shell code. We went through the various cases and when it wasn’t explained by the wrapper handling some signal and shutting down the binary, I was left with presuming the interactive shell was doing something unique. I initially observed this by running a strace against the binary and seeing the SIGINT interrupt and then later confirmed the behavior by consulting the bash source code. When you hit control-c in the shell, the shell will send a SIGINT to both processes because they are in the same process group (pgrp). I literally downloaded the bash source code to confirm this and quoting from a comment in the source code, “keyboard signals are sent to process groups”* That means a SIGINT is sent to both the wrapper and the binary. When that happens, the wrapper does nothing, as seen from prior experiments, but the binary catches it and does a clean shutdown which then allows the wrapper to complete and exit as well.

– Jon Miller

* How to efficiently root through source code is a subject for another blog. Within the bash-3.2.48.tar.gz source bundle, look at line 3230 in jobs.c.

September 27, 2009

Presenting at Inaugural CoPUG

Filed under: hadoop,python — jonEbird @ 8:34 pm

Tomorrow I will be presenting an Introduction to Hadoop: Driven by Python for the inaugural Central Ohio Python Users Group or just CoPUG for short.

I have high hopes for CoPUG. The organizer, Eric Floehr, appears to be well organized, competent individual although I have only exchanged emails and have yet to meet in person. While in Atlanta, last year for PyWorks, I learned of the very strong PyAtl group lead by none other than the current editor of the Python Magazine, Brandon Rhodes. Although I am not sure, I wonder if their Python group has something to do with PyCon coming to Atlanta in 2010. Can I dream of PyCon someday coming to Columbus?

My Introduction to Hadoop: Driven by Python slides provided under the Creative Commons Attribution 3.0 United States License.

August 10, 2009

Hadoop Elephant Makes a Big Splash

Filed under: blogging,hadoop,python — jonEbird @ 5:27 pm

Big news in the world of Hadoop today. My Running Large Python Tasks With Hadoop is published in the July Edition of Python Magazine. This marks my second article with the magazine and I had a lot of fun doing it. My interest in the anti-rdbms will continue as I continue to find interesting ways to organize data in the enterprise.

While providing a gentle introduction to Hadoop, my article also introduces readers to my HadoopCalculator which you can install a couple of different ways. First way is done via git where you can pull my HadoopUtils repo from github via:

git clone git://

That will bring a few more scripts than just my HadoopCalculator. The second way to install is to use the Python setuptools utility easy_install or pull down the source package from the Cheese Shop.

Thank you for reading this far. I lied. The big news today in the Hadoop world is Doug Cutting joining Cloudera. Had you going, didn’t I? Recently, while Doug was still with Yahoo!, the Microsoft and Yahoo Partnership had people wondering what impact that would have on the Hadoop ecosystem. Today, Yahoo! is the largest Hadoop user and for obvious reasons contributed a lot to the community. Cloudera was already a well known player in the Hadoop community but their stock has risen immensely with the addition of Doug Cutting. If they were selling stock, I’d buy.

November 16, 2008

Pyworks In Summation

Filed under: blogging,PHP,python — jonEbird @ 7:10 pm

I sit in the Atlanta Airport reminiscing over the events of PyWorks ’08. This was the first year for PyWorks but MTA combined the conference with PHP Architect and I believe everyone was happy with the combination. At a minimum, people had engaging conversations between the groups and a significant number of them cross-attended the sessions. I attended two PHP sessions and one neutral session and then the rest Python. Some people were a bit disappointed in the lack of Python attendees and it is true that we didn’t make up a large part of the total 148 attendees of the conference. But with the quality of talks staying superbly high, not having a full room wasn’t a bad thing.

The quality of talks were all superb, indeed. Probably over half of the presenters are either principle developers on high profile projects or they have written a book or own their own consulting company. On day zero, where there were 3hr long tutorial sessions, I spend the morning in Mark Ramm‘s TurboGears but then I switched over to the PHP side in the afternoon to catch Scott MacVicar and Helgi Þormar Þorbjörnsson‘s Caching for Cash.

At the start of day one, the first day of the normal sessions, I think everyone was expecting a lot more people. There were, in fact, more people but not as many as I was expecting, but again that’s perfectly okay. This day was a full one, starting off with the keynote by Kevin Dangoor about Growing your Community. After a break I then attended Decorators are Fun by Matt Wilson and learned that he is not that far away from me in Cleveland. Next I attended another Mark Ramm talk about WSGI where he was explaining how easy it was to build a web framework. It was given a bit “tongue in check” since he is the primary maintainer of TurboGears. Following that, I attended a middle track session about Distributed version control with GIT by Travis Swicegood. Travis had just finished writing a book about using GIT called Pragmatic Version Control Using Git and not surprisingly gave a authoritation explanation of using GIT. Following lunch, I attending another PHP track presentation but it could have been in the neutral middle track. The talk was Map, Filter, Reduce In the Small and in the Cloud by Sebastian Bergmann where he explained the popular functional programming techniques popularized by Google for computing large quantities of data. Sebastian gave me another reason to checkout Hadoop and in fact I’m now thinking of another Python Magazine article about using hadoop with Jython. For the last session of the day I decided to attend Michael Foord‘s talk about IronPython. I didn’t think I’d ever checkout IronPython on my own, so I thought I’d get a crash course from Michael who also just finished work on his book IronPython in Action.

Still not done with day one. After all of the normal presentation’s concluded, we had happy hour while gearing up for the Pecha Kucha competition sessions. Pecha Kucha is where you provide 20 slides and set them to auto switch every 20 seconds making your session a little over six minutes. Apparently people have found that you can get the same quality bits of information in that format as compared to a full hour session. At least that is what the Japanese have concluded. As for PHP/PyWorks, we mostly had fun with the sessions. There were talks about web security, general ranting, LOLCode, and many others which I’m having a problem remembering. At the end, the LOLCode talk took the prize of the Xbox 360 gaming system by our judges and if you’d really like to see what was going on, you may be able to watch streamed video captured by Travis Swicegood‘s iPhone. Before I went to bed, I rehearsed my presentation one more time.

By the time day two started, it felt like I had been there a full week and yet we still had a full day of presentations again. I started the morning in Chris Perkins‘s talk about the Sphinx Documentation System. We all understand the importance of documentation and it’s not always fun, but again I thought investing 45min catching up on some of the Python “best practices” for documentation would be well worth the time. Afterwards, I stayed in the same room for Jacob Taylor‘s talk about Exploring Artificial Intelligence with Python. Jacob didn’t get around to showing any Python code but he had good attendance for being a founder of SugarCRM. Next, the highlight of the conference, my presentation about LDAP and Python. The number of attendees for my presentation were average for the Python sessions and by this point I felt like I knew everyone which removed any pressure or nervousness. We’ll see how interested people were by seeing who downloads my and/or scripts. After lunch, I attended Kevin Dangoor‘s Paver talk where he explained the motivations for Paver and showed numerous examples of what pain points it solves. Finally, the last session I attended at PyWorks was Jonathan LaCour‘s talk about Elixir, the Python module which makes introduction into SQLAlchemy an easy one. Elixir helps kick start your DB code by simplifying SQLAlchemy by making a lot of sane choices for you as well as providing other conveniences. Jonathan had to work hard to get all of his content into his hour, mostly because he gave a decent overview of SQLAlchemy and then his Elixir module.

As with the previous day, this day concluded with another happy hour while waiting for our closing keynote. The closing keynote was given by Jay Pipes about “living in the gray areas” and not sticking to extreme black and white of our technologies. He praised the joint efforts being made by the PHP and Python folks and criticized people who are too biased to learn from the other communities. Jay is working on Drizzle, while working for Sun, where they are challanging all of the preconceived notions being made by the MySQL community. Drizzle is basically a fork of MySQL and their goals are to provide a much more streamlined version of a database. Jay explained that forks are good (as well as “sporks”) because it keeps people on their toes and keeps the level of competition up. Finally, Jay’s last point was that we need to spend more time listening to other people and less time preaching our biased opinions.

I overheard PHP and Python people resonating Jay’s message after the keynote. I’m glad to have participated in such a successful conference where I truely believe boundries were crossed. With as much time that I spend with the PHP folks, I was repeatedly asked, “So, you coming over to the PHP side?” I think the last time I was asked that was in the hotel pool where again I was playing the role of the “token Python guy” amongst the PHP folks. To be honest, those PHP folks know how to have fun, and if my criteria for choosing a programming language was the amount of fun the community had I would be doing PHP development. I definately want attend next year’s PyWorks and PHP conference and I have an entire year to come up with my presentation proposals.

October 25, 2008

PyWorks Stuff

Filed under: adminstration,python,usability — jonEbird @ 12:00 am

For the 2008 PyWorks convention, I will be presenting about LDAP and Python. The presentation is really about demystifying LDAP and encouraging people to use and extend LDAP for their config file needs. In efforts to make my point, the last half of my presentation will be a time for a demo. This entry is your basic landing point where you can download the scripts, presuming you are looking for a copy of the scripts and/or slides after seeing my presentation? (oh! nevermind, your google search landed you here)

PyWorks Speakers Badge

For the demo, I will be leveraging the fail2ban project. It is a python based application which scans typical application logs for security failures and bans IPs from being able to connect again. It also uses the builtin ConfigParser module for reading it’s 30+ config files, which is why I have chosen to use it. For the demo, I have created two scripts:

The first one, is used to process a set of config files and automatically generate LDAP schema as well as LDIF data.

Next, I have my module where I extended the ConfigParser module to support making queries to LDAP. I am basically overriding the read() method only and leaving the rest of the module alone. This way the only modifications to the fail2ban application are how it is instantiating the ConfigParser and I won’t have to become a full time fail2ban developer if I want to centralize the configuration data in LDAP.

And that is really the main point of my presentation: The power of centralizing your configuration data and how it can drastically change how you administer your large scale server farm.


LDAP + Python Slides. script to auto-generate LDAP schema and LDIF from ConfigParser compatible config files. python module which inherits the ConfigParser and supports optionally pulling config data from LDAP.