jonEbird

February 9, 2010

Deciphering Caught Signals

Filed under: adminstration, linux, python — jonEbird @ 6:49 pm

Have you ever wondered which signal handlers a particular process has registered? A friend of mine was observing different behavior when spawning a new process from his Python script vs. invoking the command in the shell. Actually, he was consulting me about finding the best way to shutdown the process after spawning it from his Python script. You see, the program is actually just a shell wrapper which then kicks off the real program. His program would learn the process id (pid) of the wrapper and trying to send a kill signal to that was effectively terminating the wrapper and leaving the actual program running. By comparison, I asked him what happens in the shell when he tries to kill the program. Unlike being spawned in the Python script, this time the program and wrapper together would shutdown cleanly. My initial question was, “Are there different signal handlers being caught between the two scenarios?” He wasn’t sure and our dialog afterwards is what I’d like to explain to you now.

A pretty straight forward way to query what signal handlers a process has is to use “ps”. Let’s use my shell as an example:

$ ps -o pid,user,comm,caught -p $$
  PID USER     COMMAND                   CAUGHT
 3508 jon      bash            000000004b813efb

My shell is currently catching the signals being represented by the signal mask of 0×000000004b813efb. Pretty straight forward, right? Yeah, unless you havn’t done much C programming like my friend. He was not used to seeing hexadecimal numbers where each bit represents a on/off flag for each available signal. To follow along, make sure you understand binary representation of numbers first and learn that our number 0×000000004b813efb is represented in binary as 01001011100000010011111011111011. Now viewing that number and reading from right (least significant bit) to left, note which nth bit has a one or not. You can see that it is the 1st, 2nd, 4th, 5th, etc. Now all we have to do is associate those place holders with the signals they represent. Easiest way to see which numeric values are assigned to which signals is to use the “kill” command:

$ kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
 5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
 9) SIGKILL     10) SIGUSR1     11) SIGSEGV     12) SIGUSR2
13) SIGPIPE     14) SIGALRM     15) SIGTERM     16) SIGSTKFLT
17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU
25) SIGXFSZ     26) SIGVTALRM   27) SIGPROF     28) SIGWINCH
29) SIGIO       30) SIGPWR      31) SIGSYS      34) SIGRTMIN
35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3  38) SIGRTMIN+4
39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12
47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14
51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10
55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7  58) SIGRTMAX-6
59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

Armed with this knowledge, you can now provide a human readable report for which signals my shell is capturing: It has signal handlers setup for SIGHUP(1), SIGINT(2), SIGILL(4), SIGTRAP(5), etc.

A quick note about signal handlers. A signal handler is basically a jump location for your program to goto after receiving a particular signal. Think of it as an asynchronous function call, or more succinctly as a callback. That is, your program’s execution will jump to the function you’ve registered for your signal handler immediately upon receiving said signal and it does not matter where in your program’s execution you are currently at. Since the call is asynchronous, a lot of people will have a signal handler merely toggle a global flag and let their program resume it’s processing and check on that flag at a more convenient time.

Now that we know how to see which signals are being caught by a program, and what signal handlers are, let’s create a new signal handler for my shell and note the changed signal mask. Again, reviewing my currently caught signals, I notice I’m not doing anything for the 3rd signal of SIGQUIT. I want to assign a signal handler on this signal so we can see the changed signal mask. I’m going to have the shell execute a simple function upon receipt of the SIGQUIT signal.

$ function sayhi { echo "hi there"; }
$ trap sayhi 3
$ trap sayhi SIGQUIT # same thing as the number 3
$ kill -QUIT $$
hi there

Now, how about our signal mask. Has it changed?

$ ps -o pid,user,comm,caught -p $$
  PID USER     COMMAND                   CAUGHT
 3508 jon      bash            000000004b813eff

The signal mask has changed from 0×000000004b813efb to 0×000000004b813eff. The new signal mask, converting from hexadecimal to binary, is 1001011100000010011111011111111. Notice how our 3rd bit from the right is now a “1″ and before it was “0″.

Understanding how the signal masks are represented is good, but it’s still a pain if you want to quickly compare the signals being caught between two different processes. Per that point, I created a little Python script to do the work for me:

#!/bin/env python

import sys, signal

def dec2bin(N):
    binary = ''
    while N:
        N, r = divmod(N,2)
        binary = str(r) + binary
    return binary

def sigmask(binary):
    """Take a string representation of a binary number and return the signals associated with each bit.
       E.g. '10101' => ['SIGHUP','SIGQUIT','SIGTRAP']
            This is because SIGHUP is 1, SIGQUIT is 3 and SIGTRAP is 5
    """
    sigmap = dict([ (getattr(signal, sig), sig) for sig in dir(signal) if (sig.startswith('SIG') and '_' not in sig) ])
    signals = [ sigmap.get(n+1,str(n+1)) for n, bit in enumerate(reversed(binary)) if bit == '1' ]
    return signals

if __name__ == '__main__':

    if sys.argv[1].startswith('0x'):
        N = int(sys.argv[1], 16)
    else:
        N = int(sys.argv[1])

    binstr = dec2bin(N)
    print '"%s" (0x%x,%d) => %s; %s' % (sys.argv[1], N, N, binstr, ','.join(sigmask(binstr)) )

To use the my signals.py program, copy it to a file, make it executable and run it passing the signal mask of your program.

$ wget -O ~/bin/signals.py http://jonebird.com/signals.py
$ chmod 755 ~/bin/signals.py # assuming ~/bin is in your PATH
$ signals.py "0x$(ps --no-headers -o caught -p $$)"
"0x000000004b813eff" (0x4b813eff,1266761471) => 1001011100000010011111011111111;
 SIGHUP,SIGINT,SIGQUIT,SIGILL,SIGTRAP,SIGIOT,SIGBUS,SIGFPE,SIGUSR1,SIGSEGV,SIGUSR2,
 SIGPIPE,SIGALRM,SIGCLD,SIGXCPU,SIGXFSZ,SIGVTALRM,SIGWINCH,SIGSYS

Now back to my friend and his program problem. I asked him to fire off the program both from his Python script and then again directly from the shell. Each time I asked him to check on the caught signal mask of both the wrapper program and the actual binary and report the signal masks to me. As for the wrapper, it was consistently catching only SIGINT and SIGCLD, but the story was not as clear for the binary.
When kicked off via Python, the binary was catching the following signals:

  SIGQUIT,SIGBUS,SIGFPE,SIGSEGV,SIGTERM

whereas when invoked directly from the shell, the binary was catching:

  SIGINT,SIGQUIT,SIGBUS,SIGFPE,SIGSEGV,SIGTERM

Initially, I thought, “Ah ha, see it’s catching SIGINT in addition to the other signals when invoked from the shell!”, but quelled my excitement as I realized it didn’t help to explain why both wrapper and binary were both shutting down in the shell. If you sent a SIGINT to the wrapper via “kill -INT <wrapperpid>” nothing happens. Any other signal that the wrapper was not catching, such as SIGTERM (which is the default send via “kill” when you do not specifiy a signal), would cause the wrapper to terminate and orphan the binary to remain running.

The explanation lies within the shell code. We went through the various cases and when it wasn’t explained by the wrapper handling some signal and shutting down the binary, I was left with presuming the interactive shell was doing something unique. I initially observed this by running a strace against the binary and seeing the SIGINT interrupt and then later confirmed the behavior by consulting the bash source code. When you hit control-c in the shell, the shell will send a SIGINT to both processes because they are in the same process group (pgrp). I literally downloaded the bash source code to confirm this and quoting from a comment in the source code, “keyboard signals are sent to process groups”* That means a SIGINT is sent to both the wrapper and the binary. When that happens, the wrapper does nothing, as seen from prior experiments, but the binary catches it and does a clean shutdown which then allows the wrapper to complete and exit as well.

– Jon Miller

* How to efficiently root through source code is a subject for another blog. Within the bash-3.2.48.tar.gz source bundle, look at line 3230 in jobs.c.

September 16, 2009

Server Death

Filed under: adminstration, blogging, linux — jonEbird @ 7:41 pm

I often joke that the only people that read my weblog are bots, so it shouldn’t bother me if my site is down but it does. Last week the server, which was also doubling as a workstation for the wife, died. “The computer is not working”, the Wife explained. I didn’t check it out immediately as I just assumed that X had crashed or something else preventing her from using firefox. Like I said, I’m not too overly concerned with my site’s uptime.

But when I finally did check it out, sure enough, it was not looking good. Absolutely no display on the monitor. Considering I had replaced my video card not too long ago and I could no longer ssh into the machine, I am thinking that either the CPU and/or the motherboard are dead.

DeadPC
Hercules taking a look

After Hercules and I surveyed the situation, we decided to pull the sheet over it’s head. It’s had a nice long life (pc years) since 2004.

I headed to microcenter today to checkout what kind of motherboards, CPUs and even memory that they had on sale. If you consider my last machine was running with only 756M of memory, an ageing AMD 2Ghz processor on a abit kv8 motherboard while happily serving my website and handling the Wife’s facebook usage, then you can understand I was looking for the smallest, cheapest solution I could find. That solution was looking to be somewhere around $225.

Not willing to rush into a $200+ investment, I instead bought a IDE enclosure which is capable of serving my data via USB for a mere $21 bucks.

Now for the Restoration of my Website

I really shouldn’t even be talking about this. I should have had regular MySQL dumps along with full web content backed off to another machine. Aside from a laptop, the other “real” pc in the house is a Acer I bought as a media machine which sits in my entertainment center. It was never intended to be running 24×7, so I only did on-demand backups of my important files which were actually outside of my website. Another justification for not having regular backups was that I had two internal Seagate drives configured in a software mirror. I always figured if I had some sort of hardware problem, I’d be able to replace it and in worse case never really lose my data.

So I have my hard drive and am now looking to get my Wordpress site back online with the pc in the living room. After plugging in the harddrive, I need to activate the MD device and mount up my filesystem:

[jon@pc ~]$ sudo mdadm --assemble --scan
mdadm: /dev/md/0_0 has been started with 1 drive (out of 2).
[jon@pc ~]$ cat /proc/mdstat
Personalities : [raid1]
md127 : active raid1 sdd1[1]
      241665664 blocks [2/1] [_U]

unused devices: <none>
[jon@pc ~]$ sudo mount /dev/md127 /mnt

My two machines were off from each other by two Fedora releases. I wondered if I could do a chroot, startup MySQL and get a fresh, clean dump of the database…

[jon@pc ~]$ sudo su -
[root@pc ~]# chroot /mnt
[root@pc /]# ls
bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt
opt  proc  root  sbin  selinux  srv  sys  tmp  usr  var

[root@pc ~]# mount -t proc none /proc
[root@pc ~]# /etc/init.d/mysqld status
mysqld dead but subsys locked
[root@pc ~]# /etc/init.d/mysqld restart
Stopping MySQL:                                            [  OK  ]
Starting MySQL:                                            [  OK  ]
[root@pc ~]# /etc/init.d/mysqld status
mysqld (pid 9394) is running...
[root@pc ~]# mysqldump -u root -p wordpress > wordpress.mysqldump
Enter password:
[root@pc ~]# wc -l wordpress.mysqldump
354 wordpress.mysqldump

Cool!

The rest of the migration involved an rsync of /var/www/html/ content, adjustments of the default Apache config, granting access for my Wordpress user to use the database and finally updating my router to now direct requests for port 80 to my media pc.

At this point, I guess I’ll be running this site from the living room until I decide what to do about my server / workstation. I’ve always wanted to build a slimmed down, efficient virtual server to host my website and then migrate it between server and laptop during maintenance / patching of my machines, but my AMD processor didn’t support the Virtualization assistance, so it was painfully slow. I think I’ll keep an eye out for a used, server-class machine. Let me know if you find any, bots. Thanks. ;-)

June 15, 2009

Intern Regiment

Filed under: adminstration, blogging, linux — jonEbird @ 10:06 pm

Today was Patrick Shuff’s first day with our team. He is our intern for the summer and I actually recommended that we steal him from another team after meeting him last year. From my half day assessment of him last year, I thought he was much more suited to work with our Linux team vs. the Windows provisioning team. He found my gnu screen, emacs and script automation tricks fascinating and right there just invalidated himself as a legit Windows guy. The Windows experience he picked up last year was no doubt useful, but it’s not something you enjoy to return to. Just like it’s very useful to have learned C as your first programming language, being an awesome basis to provide a solid understanding of the computing innards, but you don’t want to return to it after programming in Python.

I have been trying to brainstorm good ideas for him to work on in the team. I suppose the main reason I want to see his experience be as positive as possible is because I myself was an intern for about three years. One thought I had, was to turn the three month time schedule into an intense one assignment per week ordeal where I throw a new task at him intended to inject new incites into all facets of becoming a well rounded Linux administrator. Of course, one week is not enough time to properly study each area for most of the categories of topics I was thinking of, but it would have a nice organized structure and would be nearly guaranteed to provide an intense experience worthy of writing home about. Okay, if he ends up writing home about it, then we know he’s a dork but would also mean he’s probably found a career in which he’d never have to work a day of his life because it’s enjoyable.

I started brainstorming my categories of areas in a quick outline mode. Of course, this list is subject to change and we actually end up doing through with this I’ll naturally have to report back on the actual topics covered in each week and what the assignments were. If nothing else, it should keep my weblog busier than normal which isn’t hard. So, here what I’m thinking constitutes a well rounded Linux Administrator:

  • Ever improve efficiencies
    • editor
      - pick one: emacs or vim. Just don’t settle at being able to modestly edit text.
    • shell
      - An essential, stereotypical Linux Admin skill. And yes, it is important. Study up.
  • organizational skills
    • - Can not be underestimated. Aren’t we always ever improving our organizational skills?
      - Develop consistent habits in note taking. Try reading Getting Things Done

    • project notes
    • meeting notes
    • hallway conversations
    • company hierarchy
  • technical expertise
    • operating systems
    • programming languages
    • architectural design
    • applications administration
  • staying current
    • awesome rss feeds
    • key social article sharing sites

      - Looking at you, reddit.
    • magazines
    • books
  • soft skills
    • working within a team
    • speach / presentation
    • written communication

      - tech writing, effective email communication
  • career, career stuff
    • resume writing
    • networking
    • staying driven
    • finding your path

Sorry for the lack of details on each of the items but it’s kind of silly to populate it further now. For now it remains an idea for a summer internship. Only once the plan comes to fruition will I report back with juicier details.

November 4, 2008

Management Tools for Multi-Vendors

Filed under: adminstration, blogging — jonEbird @ 4:44 pm

The challenge to build a tool which manages multiple vendors and platforms by way of piggy backing off their technology is a losing battle. Be it provisioning, patching, monitoring, etc it doesn’t matter. To choose such a tool, you end up paying big bucks for other people to constantly watch and react to what various vendors are doing. Combine that piece of realization with the fact that a tool will almost never perfectly suit the unique requirements of your business and you’d be in denial to not realize that it sucks. Beyond the shear money of the endeavour you are also wasting time of your associates which will probably not get recouped.

I will never say anything is impossible. You can build such a tool and it can have the necessary hooks to allow your associates to customize it to suits your needs. My point is, that work is much harder to pull off than the naive observer might realize. Imagine you are abstracting the details of Suse’s automated installer “AutoYast”. But let’s say the OpenSuse project decides to take a drastic change on how the unattended installer works. Their efforts, no doubt, will be motivated by improving their end user’s experience by presumably making it quicker, simpler and overall a better product. Depending on how drastic the change, it could represent an entirely different philosophical approach to OS installs. As the tool builder, trying to provide a layer of abstraction, you have just stuck yourself into a large endeavour to re-factor those pieces of your application to handle the radical changes being made. It’s a given risk, if that is what you’re providing. My point is, as a customer, just don’t buy that product.

To purchase such a product, you are basically stating that you believe the particular team of developers are going to continue to accurately and intuitively abstract those details for you. Don’t forget you’re still paying a lot of money for this. But this is how management thinks, “I’m going to buy this tool and allow my associates to use one tool and spend their time elsewhere.” It doesn’t happen. Instead, the associates try to shift their energies on learning a new tool, figuring out how to customize it for their needs and probably end up with one FTE dedicated to maintaining it.

Please, don’t waste your time and money. Spend your time collaborating with teammates. Decide upon OS and install standards. Each OS installer provides the ability to perform basic configuration of disk, network, software, etc and then allows for final post-install hook. That hook will then lean upon your team’s efforts. You will end up spending the same amount of work creating your post-install scripts as it takes to merely install and train folks on an “all in one” tool. Big difference of “rolling your own” is you now own the tool set, it already exactly meets your needs, every one knows and understands how it works, updates are easy, knowledge and skill gained is more widely recognized and all the while you haven’t spent more money.

Now for the counter-point: You have to have a good team to pull this off. Team members will require enough experience to demonstrate the proper discernment in building out a quality framework. So what if you maintain a Solaris Jumpstart, RedHat kickstart, Suse autoyast, etc all together? Keep your data and configs centrally managed together. Parallel concepts between each one, maintain like directory hierarchies, write straight forward documentation on using and performing builds. Doesn’t it make sense to be proficient in the OS tool which comes directly from the vendor, at least from a personal development perspective? 

October 25, 2008

PyWorks Stuff

Filed under: adminstration, python, usability — jonEbird @ 12:00 am

For the 2008 PyWorks convention, I will be presenting about LDAP and Python. The presentation is really about demystifying LDAP and encouraging people to use and extend LDAP for their config file needs. In efforts to make my point, the last half of my presentation will be a time for a demo. This entry is your basic landing point where you can download the scripts, presuming you are looking for a copy of the scripts and/or slides after seeing my presentation? (oh! nevermind, your google search landed you here)

PyWorks Speakers Badge

For the demo, I will be leveraging the fail2ban project. It is a python based application which scans typical application logs for security failures and bans IPs from being able to connect again. It also uses the builtin ConfigParser module for reading it’s 30+ config files, which is why I have chosen to use it. For the demo, I have created two scripts:

The first one, configparser2ldap.py is used to process a set of config files and automatically generate LDAP schema as well as LDIF data.

Next, I have my ldapconfig.py module where I extended the ConfigParser module to support making queries to LDAP. I am basically overriding the read() method only and leaving the rest of the module alone. This way the only modifications to the fail2ban application are how it is instantiating the ConfigParser and I won’t have to become a full time fail2ban developer if I want to centralize the configuration data in LDAP.

And that is really the main point of my presentation: The power of centralizing your configuration data and how it can drastically change how you administer your large scale server farm.

Downloads

LDAP + Python Slides.

configparser2ldap.py script to auto-generate LDAP schema and LDIF from ConfigParser compatible config files.

ldapconfig.py python module which inherits the ConfigParser and supports optionally pulling config data from LDAP.

November 13, 2007

Reverse Engineering Buddy

Filed under: adminstration, linux, usability — jonEbird @ 10:34 pm

An Idea for a helpful Admin Tool

What if you got a page and/or ticket for an obscure server’s particular service? The unique problem is that your environment is huge, you’re still relatively new to the company, co-workers are not there to help you and you have never heard of this server. When logging in, you’re hoping that the person has a nice RC script under /etc/init.d/, that you can find the app via a “lsof -i:<port>”, find the application’s home and locate some log files. But what if the application install was not that nice and did not conform to the norms that you are used to?

To either a small or very large degree, you will be reverse engineering this application. If you’re really unlucky, the application who supports it also has no idea about it nor knows anything about Unix-like machines. So, what if there was an application which is polling upon logging into the server, told you, “In case you are looking for the application binX, which typically listens on port XX, it was most likely started last time by issuing the script /path/to/funky/path/binX.sh”. I’m guessing it would freak you out and immediately flood your emotions with confusion, gratitude and curiosity.

So, would such an application be difficult to write?

  • Poll any events for read/write/access under key dirs, such as /etc/init.d/, /etc/*conf ? (use the inotify syscall introduced in Linux kernel 2.6.16)
  • Track users logging into the system (could correlate later)
  • Watch for any new ports being listened on, then record the binary name.
  • Reverse engineer this application to automatically collect interesting data on it.
  • Intelligently parse an strace (note to self, checkout: http://subterfugue.org/)
  • Utilize systemtap for Linux and DTrace for Solaris. pseudo code { observe new socket being opened, so show me the last 10 files opened and executed. correlate application with startup script }

Now, if your data was collected in a easily usable format, you can collect similar data from other machines and start to make broader correlations.

The whole process is really about automating the process of reverse engineering an application. I do that alot. I believe others would like an application which aided or performed the entire reverse engineering for them.

June 21, 2007

Stripping Another Process of it’s Signal Handler

Filed under: adminstration, linux — jonEbird @ 11:38 pm

Have you ever wanted to send a signal, which normally produces a core file, but the process has one of those annoying signal handlers setup to catch the signal you’re sending? The nerve of that application trying to intelligently handle signals! I actually have a real need to remove the signal handler of a process which I’ll describe shortly. Normally, it is a bad idea to remove another process’s signal handler and under normal circumstances I do not suggest following the procedure which I am going to describe.

I have been struggling with a production issue at work with a process which has been less than cooperative. You see, I have a java process which gets crazy and starts consuming CPU cycles. When you run a strace against the process the only system call you will see is a sched_yield() call. The java thread is most likely stuck on a spinlock in user space and the process/thread which owns the lock has died or something else, but for my runaway process all it cares about it is checking for it’s lock and yielding execution back to the kernel to schedule another task. Ofcourse, it just gets the CPU again and continues to pound it.

My company pays alot of money for support and we actually have had a case now open for eight months now. The problem is we are unable to gather sufficient data for their level2 and level3 support teams. They would like a javacore to be generated, which can be done by sending a signal 3 to the java process. In addition to a javacore, they recommend sending a signal 11 (SEGV) to the process to prompt the generation of a normal binary core file. Either one would be invaluable for the support team in ascertaining what is going wrong. Unfortunately, it seems that once the process is stuck in this tight, sched_yield() loop any of the signals we send to it are being ignored. In short, that is my problem.

During my Linux Kernel Internals training with RedHat, I had an idea of writing a kernel module to strip the signal handler from the java process so I can finally generate that elusive core file. The kernel module sets up an entry under /proc named stripsignal_pid. If you read the value, it will tell you a quick one-liner about using this interface. To use the module, you write a process ID into that file and that process’s SIGABRT signal handler will be reset to the SIG_DFL. At this point, if you send a SIGABRT signal to the process the result will be writing out of it’s core file.

Download the source along with a helper test program here: stripsignal.tar.gz.

But if all you are interested in is reviewing the short source code, then feel free to browse the stripsignal.c source online.

I tell you what, the best thing I learned from the class was just familiarizing myself with the source code and actually learning some new emacs tricks for navigating large source code projects. Next writeup will be about my experiences with the GNU global tagging system.

January 15, 2007

Python Admin vs. Java Developers

Filed under: adminstration, usability — jonEbird @ 5:31 pm

What is the best programming language for a system administrator? Queue the language war, please. The typical arguments are “your language can’t do this”, “this library doesn’t have a consistent naming convention”, well “my language is faster”, yeah and “your syntax is hideous to read much less use”, blah blah blah. No, I’m not a professional developer but I do spend a significant time doing development as a systems administrator. My programs are not huge year long projects, will probably never reach million lines of code and usually never need superb speed. For administrators, the most important aspect of the language of choice is productivity and maintainability.

When choosing your language, I recommend picking one that has a decent user community, is available on numerous platforms, has had significant time to mature in proving itself and has an extensive modules/library support. Meeting these requirements will leave you using a language that should keep you efficiently producing solutions to your administrative tasks.

First let’s eliminate some languages based on maintainability. Goodbye Haskell, lisp, scheme, Erlang and any other purely functional languages you have used or know of. I’d venture to say that less than 2% of system administrators are comfortable using any one of those languages. And you can obviously not choose a language which only yourself are going to be able to maintain. Aside from staying away from the obscure, the program should be intuitive to read. People can argue on the virtues of their favorite language and why it lends itself to writing maintainable code, but writing maintainable code is truly a skill. You can write obfusticated code in any language. It takes practice and a conscience effort of keeping your code clean and organized well. Here, practice makes perfect, is the key.

Secondly, and in my opinion the most important aspect of the language of choice is staying efficient. Ideally, each program should be succinct and to the point. I no longer use C/C++ regularly, even though that’s the language I started with, because you simply have to write much more code which another language can do in half or less of work. Try looking at one of the ‘P’s of the LAMP stack and see which fits you better and you can see yourself being productive in. That is, evaluate Python, Perl, PHP and Ruby (okay, not a ‘P’ but whatever). Don’t use a language that doesn’t make sense to you. Don’t waste your time.

And finally, time to explain this title and tell a little story where some customer data was delayed during one day’s production incident. One day, we had a production issue where messages were accidentally dequeued from a IBM Webphere MQSeries queue. A tool which was used to grab just one message dequeued all of the messages. To top it all off, the same tool kept seg faulting while trying to requeue the same messages. The solution left to us was to manually parse out each of the discrete messages into separate files. Once in that state, we had another known tool which could upload the messages separately. There were three developers and myself on the phone and we were all racing to the solution. My language of choice was Python and the rest of the developers used the language that they use professionally, Java. So who reached the solution first? Well I wouldn’t be writing this if I hadn’t won, would I? For me, Python makes sense and I can efficiently write code which I like to think other people will be able to understand and update. That is what is most important for your language of choice.

[ As un-entertaining as it is, you can view the Python solution. ]

November 6, 2006

Scripting Best Practices

Filed under: adminstration, usability — jonEbird @ 6:35 pm

Nothing too fancy here. Just a list of the most common things I find desirable while writing shell scripts.

  1. Use meaningful variable names
  2. This point is strictly for the sake of readability. Too often when trying to read somebodies script I’ll actually do various search & replaces of their variables because they used variables like “w”, “w2″, “w3″. It was quick and dirty for the author, but the inheritor of that script would appreciate if you had used more meaningful variable names.

  3. Comment your code.
  4. This goes without saying, really…

  5. Visually separate any optional settings sections.
  6. Don’t know about you, but sometimes I get lazy and don’t feel like using getopts. Instead, I’ll throw my what would be optional arguments as hard coded variables at the top of my script. I think this is fine, but you’ll want to visually segregate these optional variables from the rest of the script.
    I like to use a — dashed line of about 50-70 characters and even put the words “do not modify beyond this point” to further emphasize what you’re encouraged to change and what shouldn’t normally be touched.

  7. Use relative pathing for accessing files to the script.
  8. Never assume the user’s cwd is the same as the script and use “./” to run or source another file. I like to set a variable REL_DIR=`dirname $0` and use it to reference the directory where the very script is running from.
    E.g. You have a functions script you’d like to source, then with that REL_DIR variable you would “. ${REL_DIR}/<some-file>“.
    I’m actually surprised on how often this happens.

  9. Always print a usage statement for improper usage and/or when -h option used.
  10. My code excerpt typically looks like:

    USAGE="Usage: `basename $0` <my options here>"
    if [ -z "$SOME_ARG" ]; then
        print $USAGE 1>&2
        exit 1
    fi
    
  11. Conscientiously use STDOUT vs. STDERR in different scenarios.
  12. Not a script faux pau really, but it can help during the development process. Use STDOUT only for informational messages and/or optional debugging info. Then STDERR would only be used for errors. That way, when running the script you can optionally turn off stdout (1>&-) and easily check that nothing was printed to STDERR. When the output is mixed you’ll have a greater chance of missing the error.
    One example of this technique in action is when using the tar command. Try leaving out the verbose (’v') option when creating or extracting your archive, then you can easily see when you might have had a permissions issue or something else related.

  13. Keep all required variables defined in the script.
  14. Define the required variables at the top of the script. Even mention that they are REQUIRED. A good example of this is scripts that use Sybase’s isql utility. Anytime I run isql, I like to set something like:

    # required variables for isql
    SYBASE=/some/path/to/sybase
    LD_LIBRARY_PATH=$SYBASE/lib
    

    What you want to avoid is a situation where the script works because you’ve got the required variable set in your env, but only because it’s set in one of your dot files.

  15. Cron’ed Scripts.
  16. Two common principles I like to emphasize here:
    1. Keep all required variable/env settings in the script! cron does NOT source your dot files.
    2. Redirect stdout, but leave stderr unmanaged. This is a cheap technique, but whenever I don’t have time to test for all possible errors I simply setup my .forward file and let cron email me the output produced from the cron script. Though, to be complete, you should really manage your stderr in other fashions.

  17. Keep your exit/return codes categorized.
  18. Not always important for small scripts, but a good practice.
    For any sort of error checking your script might perform, use a unique error code for each situation that you decide to exit the shell script. That will make invocations of your script more manageable.

  19. Avoid “Magic” Numbers
  20. Anywhere you are comparing a value to some, seemingly, arbitrary number, go ahead and set that value to a meaningful variable name. Then your comparison reads alot better.
    Using “$CURRENT_VALUE -gt $THRESHOLD” is much better than finding “$CURRENT_VALUE -gt 83” buried in some script and not having any clue what the number 83 signifies aside from the surrounding code.

  21. Use unique temporary files.
  22. Never do this: /path/to/some/command -option > command.out.
    You are assuming that you are sitting in a directory where you have permissions to create a temporary file and secondly that no one will ever be running the same script at the same time you are.
    Some shells make creating temporary files easy with commands such as mktemp. I typically employ a convention where I define my temporary file space as “TEMP=/tmp/.myshellname$$_“. Then lets say I need a temp file to capture the output from ps. I might redirect it to ${TEMP}raw_ps.
    And finally, at the end of the script, or defined in a shell function, you can cleanup each temporary file with one line: rm -f ${TEMP}*.

In general, well written code/scripts should read well and be organized well. Every principle discussed above has one purpose: maintainability.

October 19, 2006

Automation in Three Phases

Filed under: adminstration — jonEbird @ 6:05 pm


My previous post talked about my passion for automation in administration work. My claim was that once you started automating your tasks you wouldn’t look back. So now, here is my proposal for how you might develop those wonderful habits. You develop your automation in three phases:

Phase I.

  This is your initial attempt at installing the software or conducting some other procedure. Keep an editor open and keep track of every explicit step you take… each change directory, file edit, useradd, chmod, limits update, kernel parameter. The finished document can double as an appendix section to your Disaster Recovery (DR) documentation detailing every explicit step and detail.

Phase II.

  The next phase doesn’t come along until you need to repeat the same procedure again. At this point, you retrieve your notes from the initial work. Chances are, there are slight differences with this iteration of the work, such as, executing on another machine, using a different account, installing down a different directory, using a different database. What you realize is that these difference are merely cosmetic and don’t really pertain to the piece of software, per se, but instead mere environmental changes. When you notice this you suddenly realize that you can take that first document and create a script out of it. There is not much need to get fancy at this point. Start by creating variables for all of those environmental values and move those definitions to the top of your script. E.g. If in your first doc, you executed a “useradd myapp1″, you’ll change it to setting “username=myapp1″ and update the command to be “useradd $username”. That’s it. Again, no need to get fancy at this point.

Phase III.

  Just like the last phase, this phase doesn’t begin until you yet again need to perform the same task. This time you retrieve the script used in phase II, but this time before executing you might observe some key improvements that can be made to the script? It is the same principle as in writing papers; you need to give yourself some time after writing your first draft before you’ll see the problems with the paper. In this case, hopefully you see some key areas which can use some improvements, such as error checking  or even abstracting the process even further.

I have used this technique for nearly every application I have to install from the simplistic Apache to the very error prone Oracle RAC install. In each case, I have a script which performs each preparation step as well as the final install. The beauty of approaching your automation in three steps is that you can ease your way into it. The time between iterations can also help you realize key points in the final resultant script. My first Oracle RAC installation took days. Now, I can install a multi-node Oracle RAC cluster in a couple hours. Aside from the obvious speed benefit, I am also getting a consistent result. Consistency leads to predictability and predictability leads to easier, brain dead administration which is what we are really trying to accomplish. Right?

Next Page »