sed Stream Editor
The sed or “stream editor” program is a
Unix/Linux command-line utility, developed at Bell Labs in
1973, with the following features:
sed <inputfile >outputfile
sed -f scriptfile <inputfile >outputfile
A very simple sed program can be written directly on the
command line:
sed 'program' <inputfile >outputfile
sed makes a
single pass through its input file, reading a line at a time, applying
the script to that line, and writing the line out. The flow is
therefore
.
s/RE/newdata/, where newdata is substituted
for the matching regular expression, if found.
sed script can contain many lines, each of which
specifies another action. Each is executed in order against the
current data line. After all actions have taken place, the
modified line is written out. If none of the actions match, the line
is normally written out unchanged and the next line is read.
sed. Linux
provides GNU sed (and GNU AWK); from the
user perspective the main improvement is that both use extended
regular expressions, as do vim, GNU egrep,
Perl, etc.
Here’s a very simple but very useful example. This command will
match every line of fileone which contains the pattern
“John Doe” and when found, executes a substitution action to
replace “John” with “Jane”. The resulting
file is written to filetwo.
sed '/John Doe/ s/John/Jane/' fileone >filetwo
OK, you might admit this is probably faster than loading
fileone into a WYSIWYG editor like emacs or
vim, doing a global search and replace, and writing the
result. It gets much better, though. Wrap a for....
loop around that statement and you can make that change on
every text file in the current directory:
for f in *.txt; do sed -i '/John Doe/ s/John/Jane/' $f; done
The -i option (gawk-only) says to replace the
original file with the modified one. If you have no confidence, it
will also make backups; follow the -i with a suffix to use:
for f in *.txt; do sed -i.bak '/John Doe/s/John/Jane/'$f; done
Note this will create xxxxxx.txt.bak for every file it processes,
even if the .txt file was unchanged.
Even more impressively, if you are using zsh instead of
bash, then three extra characters cause this one
line to do its thing on every .txt file in the entire
directory tree!
for f in **/*.txt; do sed -i -e '/John Doe/ s/John/Jane/' $f; done
(That “**/” notation is the signal for zsh to recursively
descend all directories below and including the current one, looking
for files that match “*.txt”.)
If you have a thousand files to modify, the savings in time and effort
would be enormous compared to using vim, e.g., to
edit the files one by one. (This is not a far-fetched example.
Suppose that your company’s advertising department decides that
your FoobarCalc product is hereby to be called
WonderCalc II instead, and you have the job of
changing all the corporate documentation and web sites.)
If the script is short enough, you can put it on the command line even if it has multiple commands. Just separate them with semicolons. For example
sed -r -e 's/ */ /g ; s/\. /. /g' test.txt
The first statement s/ */ /g reduces all runs
of more than one blank to a single blank. The second statement
s/\. /. /g reverses the first in a special
case — it changes period-blank to period-blank-blank. Given the
file
Line one. A second sentence. Next sentence.
Line two. Third sentence.
the output will be:
Line one. A second sentence. Next sentence.
Line two. Third sentence.
That earlier combination of a for... loop (in
bash or zsh) and recursion (zsh
only) is is a very powerful feature. Here’s another
zsh example of exactly the same structure. It creates a
100x100 pixel (at most) thumbnail for for every .jpg file in your
directory tree:
for i in **/*.jpg; do convert -geometry 100x100 $i th-$i; done
convert is part of the imagemagick suite of
command-line utilities normally installed by default on all Linux
systems. If it isn’t there for some reason, then execute:
apt-get install imagemagick
The AWK programming language was developed in 1977 at
Bell Labs by Alfred Aho, Peter Weinberger, and Brian Kernighan, hence
the name. It was built on the pre-existing sed; its
major purpose is producing formatted reports. Among the features of
AWK are the following:
AWK have appeared over the years
— mawk, nawk (new awk),
etc. In Linux, invoking awk will get gawk,
the GNU version.
sed, scripts
can use a complete programming language. The basic flow is the same
as shown above, except that AWK provides jumps,
user-defined functions, loops, etc.
condition {action}. A condition is normally a RE used to
select lines to operate on, and an action is a series of commands.
Other conditions are BEGIN and END to mark
actions to be taken before any line is read or after the last one is
processed. If there is no condition, the action is applied to each
line of input.
AWK has user-defined variables, which can be strings,
numbers, or arrays. Math functions are available for numeric
variables.
AWK automatically divides input lines into
fields, which are available (for each line) in variables $1,
$2, ...
AWK also has many built-in functions; for example
length() returns the number of characters in the
current line.
sed, AWK’s default behavior
is to output text only when specifically requested, usually by the
print command.
AWK program will output the
number of words, characters, and lines in a document. Since there is
no condition, the main block is applied to every line.
{
c = c + length($0) + 1 # $0 is the current line.
w += NF # same, using C shorthand
}
END {print NR, w, c }
If the above program was stored in the file wc.awk, it would be
invoked with:
awk -f wc.awk textfile
or, alternatively, the script can be made executable:
#!/usr/bin/awk -f
{
c = c + length($0) + 1 # $0 is the current line.
w += NF # same, using C shorthand
}
END {print NR, w, c }
invoked as
wc.awk textfile
{
print $2 ",", $1
}
would take a file in the form:
John Dierdorf Don Jones Elmer Fudd Amy Johnson ...and put it out with name reversed and comma inserted:
Dierdorf, John Jones, Don Fudd, Elmer Johnson, Amy ...Again, with no condition, it is applied to every line. This program is short enough it would be easier to simply type it on the command line:
awk '{print $2 ",", $1}' namefile
If you want to print only names containing “John”, then
provide a condition:
awk '/John/ {print $2 ",", $1}' namefile
Dierdorf, John
Johnson, Amy
awk '/^John/ {print $2 ",", $1}' namefile
Dierdorf, John
Extended regular expressions, used by sed, AWK, vim, Perl, egrep, and other programs, add very useful features to “standard” REs:
/a+b/ will match “ab”, “aaab”, etc.
/ab?/ will match “a” or “ab”.
/big|small/ matches either “big” or
“small”.
/[a-z]/ to match a lower-case letter. This
will miss é, ö, ñ, and so on. To prevent this
error, use the POSIX classes instead:
[:alnum:] an Alphanumeric character in the current
locale.
[:alpha:] an Alphabetic char, ditto.
[:digit:] a Numeric digit, ditto.
[:xdigit:] a Hexadecimal digit.
[:upper:]. [:lower:], [:punct:]. [:cntrl;], etc.
Note that these are ranges, which go inside square
brackets. Therefore, the POSIX way to match a lower-case character is
[[:lower:]], while matching an upper-case character or
digit at a particular position is done using
[[:upper:][:digit:]].
/a.*b/ will match the longest
possible string between “a” and “b” —
this behavior is called “greedy” matching, because the RE
gobbles up everything it can. The syntax /a.*?b/ will
match the shortest.
Note that Perl has an even richer set of RE syntax,
including exotic stuff like Look-Ahead (match
“a” only if followed by “b”, but do not
actually process the “b”). Python, Ruby,
Microsoft .NET, and Java have all adopted the Perl
extensions.
In either bash or zsh, you can insert the
output of a command into the line you are building by
enclosing the command in backquotes or inside $(........). Either of these
echo Tomorrow at this time it will be `date -u --date="tomorrow"` in London.
echo Tomorrow at this time it will be $(date -u --date="tomorrow") in London.
will produce something like:
Tommorow at this time it will be Mon Apr 16 03:46:19 UTC 2012 in London.
(The $(....) form is preferred because it makes the shell’s
quoting rules much simpler.)
One of the most useful applications of this technique is combining it
with grep. Grep’s -l option outputs only the file
names where there is a match to the given RE. Therefore, you can
shorten the number of files processed by sed,
AWK, or any other program by using a loop like this:
for i in $(grep -l ABC *.txt); do sed -f sedfile $i; done
Sed will only work on those .txt files in the current directory which
contain the string “ABC”. Note that grep
also has a “recursive” switch, so
for i in $(grep -lr ABC *.txt); do sed -f sedfile $i;done
will find all text files containing ABC anywhere in the tree below the
current directory.