sed Stream Editor

Features

The sed or “stream editor” program is a Unix/Linux command-line utility, developed at Bell Labs in 1973, with the following features:

Examples

Here’s a very simple but very useful example. This command will match every line of fileone which contains the pattern “John Doe” and when found, executes a substitution action to replace “John” with “Jane”. The resulting file is written to filetwo.

sed '/John Doe/ s/John/Jane/' fileone >filetwo

OK, you might admit this is probably faster than loading fileone into a WYSIWYG editor like emacs or vim, doing a global search and replace, and writing the result. It gets much better, though. Wrap a for.... loop around that statement and you can make that change on every text file in the current directory:

for f in *.txt; do sed -i '/John Doe/ s/John/Jane/' $f; done
The -i option (gawk-only) says to replace the original file with the modified one. If you have no confidence, it will also make backups; follow the -i with a suffix to use:

for f in *.txt; do sed -i.bak '/John Doe/s/John/Jane/'$f; done
Note this will create xxxxxx.txt.bak for every file it processes, even if the .txt file was unchanged.

Even more impressively, if you are using zsh instead of bash, then three extra characters cause this one line to do its thing on every .txt file in the entire directory tree!

for f in **/*.txt; do sed -i -e '/John Doe/ s/John/Jane/' $f; done
(That “**/” notation is the signal for zsh to recursively descend all directories below and including the current one, looking for files that match “*.txt”.)

If you have a thousand files to modify, the savings in time and effort would be enormous compared to using vim, e.g., to edit the files one by one. (This is not a far-fetched example. Suppose that your company’s advertising department decides that your FoobarCalc product is hereby to be called WonderCalc II instead, and you have the job of changing all the corporate documentation and web sites.)

If the script is short enough, you can put it on the command line even if it has multiple commands. Just separate them with semicolons. For example

sed -r -e 's/  */ /g ; s/\. /.  /g' test.txt
The first statement s/  */ /g reduces all runs of more than one blank to a single blank. The second statement s/\. /.  /g reverses the first in a special case — it changes period-blank to period-blank-blank. Given the file
Line one.  A       second        sentence. Next sentence.
Line two.     Third sentence.
the output will be:
Line one.  A second sentence.  Next sentence.
Line two.  Third sentence.

That earlier combination of a for... loop (in bash or zsh) and recursion (zsh only) is is a very powerful feature. Here’s another zsh example of exactly the same structure. It creates a 100x100 pixel (at most) thumbnail for for every .jpg file in your directory tree:

for i in **/*.jpg; do convert -geometry 100x100 $i th-$i; done
convert is part of the imagemagick suite of command-line utilities normally installed by default on all Linux systems. If it isn’t there for some reason, then execute:
apt-get install imagemagick

AWK

Features

The AWK programming language was developed in 1977 at Bell Labs by Alfred Aho, Peter Weinberger, and Brian Kernighan, hence the name. It was built on the pre-existing sed; its major purpose is producing formatted reports. Among the features of AWK are the following:

Examples

  1. The following simple AWK program will output the number of words, characters, and lines in a document. Since there is no condition, the main block is applied to every line.
    {
       c = c + length($0) + 1   # $0 is the current line.
       w += NF                  # same, using C shorthand              
    }
    END {print NR, w, c }
    
    If the above program was stored in the file wc.awk, it would be invoked with:
    awk -f wc.awk textfile
    or, alternatively, the script can be made executable:
    
    #!/usr/bin/awk -f
    {
       c = c + length($0) + 1   # $0 is the current line.
       w += NF                  # same, using C shorthand              
    }
    END {print NR, w, c }
    
    invoked as
    wc.awk textfile
  2. The program
    {
       print $2 ",", $1
    }
    would take a file in the form:
    John Dierdorf
    Don Jones
    Elmer Fudd
    Amy Johnson
    ...
    and put it out with name reversed and comma inserted:
    Dierdorf, John
    Jones, Don
    Fudd, Elmer
    Johnson, Amy
    ...
    Again, with no condition, it is applied to every line. This program is short enough it would be easier to simply type it on the command line:
    awk '{print $2 ",", $1}' namefile
    If you want to print only names containing “John”, then provide a condition:
    awk '/John/ {print $2 ",", $1}' namefile
    
    Dierdorf, John
    Johnson, Amy
    awk '/^John/ {print $2 ",", $1}' namefile
    
    Dierdorf, John

Extended Regular Expressions

Extended regular expressions, used by sed, AWK, vim, Perl, egrep, and other programs, add very useful features to “standard” REs:

One or More
/a+b/ will match “ab”, “aaab”, etc.
Zero or One
/ab?/ will match “a” or “ab”.
Alternation
/big|small/ matches either “big” or “small”.
POSIX Character Classes
These are non-English ways of specifying characters. The unwary might use /[a-z]/ to match a lower-case letter. This will miss é, ö, ñ, and so on. To prevent this error, use the POSIX classes instead:
  • [:alnum:] an Alphanumeric character in the current locale.
  • [:alpha:] an Alphabetic char, ditto.
  • [:digit:] a Numeric digit, ditto.
  • [:xdigit:] a Hexadecimal digit.
  • [:upper:]. [:lower:], [:punct:]. [:cntrl;], etc.

Note that these are ranges, which go inside square brackets. Therefore, the POSIX way to match a lower-case character is [[:lower:]], while matching an upper-case character or digit at a particular position is done using [[:upper:][:digit:]].

Lazy Matching
Normally, /a.*b/ will match the longest possible string between “a” and “b” — this behavior is called “greedy” matching, because the RE gobbles up everything it can. The syntax /a.*?b/ will match the shortest.

Note that Perl has an even richer set of RE syntax, including exotic stuff like Look-Ahead (match “a” only if followed by “b”, but do not actually process the “b”). Python, Ruby, Microsoft .NET, and Java have all adopted the Perl extensions.

A Useful GREP Trick

In either bash or zsh, you can insert the output of a command into the line you are building by enclosing the command in backquotes or inside $(........). Either of these

echo Tomorrow at this time it will be `date -u --date="tomorrow"` in London.
echo Tomorrow at this time it will be $(date -u --date="tomorrow") in London.
will produce something like:
Tommorow at this time it will be Mon Apr 16 03:46:19 UTC 2012 in London.
(The $(....) form is preferred because it makes the shell’s quoting rules much simpler.)

One of the most useful applications of this technique is combining it with grep. Grep’s -l option outputs only the file names where there is a match to the given RE. Therefore, you can shorten the number of files processed by sed, AWK, or any other program by using a loop like this:

for i in $(grep -l ABC *.txt); do sed -f sedfile $i; done
Sed will only work on those .txt files in the current directory which contain the string “ABC”. Note that grep also has a “recursive” switch, so
for i in $(grep -lr ABC *.txt); do sed -f sedfile $i;done
will find all text files containing ABC anywhere in the tree below the current directory.


Last modified: Sun Apr 15 23:25:12 CDT 2012