Saturday, December 16, 2006

Unix command in 'awk'

uname -a: Linux MyMachine01 2.4.21-37.ELsmp #1 SMP Wed Sep 7 13:32:18 EDT 2005 x86_64 unknown unknown GNU/Linux
i.e. Red Hat Enterprise Linux 3 running on AMD64. Shell: tcsh

Today's tip is about how to make awk run Unix commands on its input.

As it happened to me today, sometimes we want 'awk' to perform some UNIX command on some field in the input. I performed a Google search without much trouble but putting it here, nevertheless, would be a good idea.

So what I wanted to do was, look for a pattern in all the files in all the subdirectories in the current working directory (CWD) and if the pattern is present in the files, present the name of the directory which contains this file.

Let's say the pattern was: PaTtErN; and the directory structure:
Top/:
Top/Dir_1: Top/Dir_2: Top/Dir_3: Top/Dir_4 ...
Top/Dir_1/a.out Top/Dir_2/a.out Top/Dir_3/a.out ...
Top/Dir_1/b.out Top/Dir_2/b.out Top/Dir_3/b.out ...
Top/Dir_1/c.out Top/Dir_2/c.out Top/Dir_3/c.out ...


This is how I proceeded:
  1. Let's start with a simple grep command:
    grep PaTtErN Top/Dir_*/*
    The output of this command, for every line that matches in any of the files, is of the format:
    <FILENAME>:<MATCHING-LINE>
  2. Now let's extract the filenames from that list. Here comes the use of awk, to get a particular column:
    grep PaTtErN Top/Dir_*/* | awk -F':' '{print $1}'
    Print the 1st ($1) column of grep command's output, where each column is assumed to be separated by a ':'
    That's how we separate out the name of files containing the pattern we are searching for.

  3. The problem with this output is that in each directory, there could be multiple files which could possibly contain this pattern multiple times. However, what we really care about is the parent directory name, for example, Top/Dir_2.Here, I would introduce a lesser known but nevertheless, very useful command: dirname. I would also encourage you to look at the man page of another similar command: basename. These two come in really handy sometimes, as we'll see shortly.
    grep PaTtErN Top/Dir_*/* | awk -F':' '{print "dirname " $1 }' | sh
    That says, on the shell (sh), execute the command 'dirname <First-Column-Of-grep-Output>', which will give us the directory name of the file that contains the pattern.
    The problem is that it produces the directory name for each match in each file but we want it only once.

  4. No problem, that is taken care of by this:
    grep PaTtErN Top/Dir_*/* | awk -F':' '{print "dirname " $1 }' | sh | sort | uniq | less
  5. And lo, and behold, we have what we wanted.

Tip: It is always a good idea to less the output of a command, specially when you are not sure how big the output is going to be!

Labels: , ,