Awk notes: Difference between revisions

Latest revision as of 19:31, 15 July 2023

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Intro

Awk is useful for simple transforms of text, particularly text structured by fields, such as comma or tab separated columns.

Awk does matching (grep-like), splitting into fields to match, reformatting (cut/paste/sed/printf/whatnot-like).

It's actually a fairly minimal-syntax programming language (with few types - most things are strings) that for some basic tasks is more succinct than shell scripting, and sometimes it can be handier than sed.

...but if you're going to need any real logic you might as well use none of awk/sed/shell, and go for a more serious scripting language.

Basics

At its most basic, awk parses and walks space-separated (by default) fields within newline-separated records.

For example, it could selectively present parts of the output of ps:

$ ps aux | awk '{ print "User "$1" has been running "$11" since "$9"." }'
User root has been running bash since Nov24.
User me has been running -bash since 10:24.
...

You can prepend a regexp, like /regexp/ {command}, often for grep-like filtering.

For example:

"When passwd's shell field (7) contains /bin/false or /nologin" (you can also match against $0, the full current record, to make it act rather like egrep):

$ awk -F':' '$7~/(\/bin\/false|\/nologin)/ {print $1" is not a real login"}' /etc/passwd
daemon is not a real login
...

"If the home dir field (6) doesn't have 'home' in it":

$ awk -F':' '$6!~/home/ {print $1" is a non-regular user -- homedir is "$6}' /etc/passwd
mysql is a non-regular user -- homedir is /var/lib/mysql
...

seeing who is currently logged in, and since when:

$ last | awk '/still logged/ {print $1"\tsince "$4" "$5" "$6"\t(on "$2")" }'
root    since Wed Dec 5 (on pts/0)
root    since Aug 10 00:06      (on :0)
root    since Aug 9 03:38       (on tty1)

Input parsing, output formatting

Variables related to input, and output: (There are others that are sometimes useful in the middle of processing)

FS: Field Separator (on input; initially acts on spaces and tabs(verify).

can also be handed in with -F on the command line. You can give it several characters to split on(verify); at least, mine accepts "[/~]", and also "[ ]+")

RS: Record Separator (on input, initially \n)

OFS: Output Field Separator (initially )
ORS: Output Record Separator (initially \n)

For example, a list of all users on one line:

$ awk 'BEGIN {FS=":"; ORS=", "} {print $1}' /etc/passwd

root, daemon, bin, sys, sync, games, man, lp, mail, ...

Blocks, processing in separate steps

You can define multiple blocks

a BEGIN block will be executed before line processing

one (or more, see below) main block(s), the line processing, which are unmarked

an END block after line processing

This helps do setup, processing, and reporting, respectively.

User summary:

cat /etc/passwd | awk 'BEGIN {FS=":"; OFS=""; ORS="\n\n"} $7!~/false/ {print "User: "$1"  (UID "$3", in group "$4")\n  Shell:     "$7"\n  Home dir:  "$6"\n  Name:      "$5  }'
</code>
 User: backup  (UID 34, in group 34)
   Shell:     /bin/sh
   Home dir:  /var/backups
   Name:      backup
 ...

Using associative arrays to summarizing how many users use each shell:

cat /etc/passwd | awk '
   BEGIN {FS=":"} 
   {shells[$7]++}  
   END { for (shell in shells) printf("%15s users: %d\n", shell, shells[shell]) }' 

    /bin/false users: 11
       /bin/sh users: 16
     /bin/bash users: 3
     /bin/sync users: 1

#!/bin/bash
# mentions users that use more than negligible CPU and/or memory
ps --no-headers -axeo user,%cpu,%mem | \
  awk '{usercpu[$1]+=$2; usermem[$1]+=$3} 
       END { for (u in usercpu) { if (usercpu[u]>5 || usermem[u]>5) 
             printf("%15s using  %4d%% CPU  and %4d%% resident memory\n", 
                        u, usercpu[u], usermem[u]) }  }'

       postgres using     4% CPU  and   83% resident memory
       liquids+ using    60% CPU  and    1% resident memory
       www-data using     4% CPU  and   27% resident memory
           root using    52% CPU  and   11% resident memory

Use of several main blocks, makes sense when you have filters.

Example from munin:

netstat -s | awk '
/active connections ope/  
/passive connection ope/  { print "passive.value " $1 }
/failed connection/       { print "failed.value " $1 }
/connection resets/       { print "resets.value " $1 }
/connections established/ { print "established.value " $1 }'

passive.value 181432543
failed.value 91976
resets.value 2954810
established.value 142

Seeing what programs are listening for connections, and what connections are currently established. (can be put into a shell script as-is):

 netstat -pan | egrep -v '\bunix\b' | awk -F'[/ ]+' '
   $0~/LISTEN/      {printf("%20s  listening on  %-20s  (PID %s) \n", $8, $4, $7)}
   $0~/ESTABLISHED/ {printf("%30s  <--connection-->  %s \n", $4, $5)}'

Or some more passwd related summary stuff:

#!/usr/bin/awk -f
BEGIN {
   FS=":"
   printf("Looking at line ")
}

{
   printf("%d, ", NR)
   shells[$7]++;
   if ($3<1000) subthousand++
   else         supthousand++
}
$6~/\/home/  { home++ }
$6!~/\/home/ { nothome++ }

END {
   printf("Done.\n")
   for (shell in shells) {
      printf("%15s users: %d\n", shell, shells[shell])
   }
   printf("Users with /home directories: %d.  Users based elsewhere: %d\n", home, nothome)
   printf("Users with UID <1000: %d,   >1000: %d\n", subthousand, supthousand)
}

...into a file 'passwdparser' and chmod +x it. Now you can run it like ./passwdparser /etc/passwd.

(...NR is the number of records seen so far, which is frequently used as line numbering)

Language notes

AWK has while, do-while, for loops, for-in loops, arrays (note: one-based indexing), associative arrays, a bunch of operators (~ and !~ seem to be the only unusual ones), and more.

String functions include length, sub and gsub (replacing), substr, match, index, printf, sprintf.

Math functions include sin, cos, log, etc., though not min or max.

You can define your own functions.