Find and xargs and parallel

From Helpful
Jump to: navigation, search
Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts · Shells and execution ·· find and xargs and parallel · screen and tmux
Linux admin - disk and filesystem · users and permissions · Debugging · security enhanced linux · health and statistics · kernel modules · YP notes · unsorted and muck
Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·


Remote desktops
VNC notes
XDMCP notes



xargs

xargs reads arguments from stdin, builds commands with it, and executes those commands.


In typical use you feed it filenames, and frequently get those from find, to run batch jobs.

It allows dealing with spaces in filenames (which is harder with just shell-fu).

It allows running a pool of processes (note [[parallel] is sometimes nicer)

It is one solution to the argument list too long problem {{{1}}}.


For example, to print all filenames under a path (because xarg's default command is /bin/echo):

 find /proc | xargs

If you expect spaces (or, for good habit, always), use:

 find /proc -print0 | xargs -0


There are some useful arguments to know:

  • -n number
    - each command is built with this many files at most. (Default is as many as the underlying system allows?(verify))
some programs want a single input file, in which case you need
-n 1
  • -P number
    - spawn a pool of this many subprocesses
You then usually want to -n with a lowish number -- now effectively 'hand around blocks this large to processes'
and sometimes specifically 1, particularly when handing more than one filename to a command has meaning like cmd readfromthis writetothis)
see also parallel, which is sometimes simpler
  • -t
    : print the commands before they are run
can be useful for debug. So can putting echo at the start of the command.


(Note that you can do most of the above (safe filenames, avoid argument list too long problem) without xargs, by using find's -exec, e.g.
find /proc -exec ls '{}' \;
but I find its syntax hard to get right.


Multiple commands on each filename

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


I sometimes remember that parallel is nicer for these things (and can sometimes be cleaner without find)

ls | parallel 'ls -l {}; vmtouch {};'


It's easier to remember to wrap in a shell, like:

ls | xargs --replace=@  sh -c 'ls -l "@" ; vmtouch "@"'

...because in the immediate sense it's really just a single command, which happens to run multiple others.

The doublequotes there are necessary around spaces, whether you use -0 or not.

This is one of a few reasons a simple script is sometimes easier.

read as a xargsalike

Bash has a builtin called
read
, which reads one line from stdin into one or more variables.

It does Command_line_and_bash_notes#word_splitting according to IFS.

If no variable name(s) are given, it will set into shell variable REPLY (not POSIX, but you can still assume it), without word-splitting.

(Keep in mind you may want to double-quote "$REPLY" when you use it to avoid whitespace stripping)

A slightly more standard way would be to empty IFS and set a single variable


Combined with while, you can use this to stream output into a bash code block. For example:

find . -print0 | while IFS= read -r -d '' line ; do
  echo "${line}";
done

This can sometimes be less weird than xargs (e.g. escaping-wise) when you want to do nontrivial things.


Notes:

  • you probably always want -r. ((verify) why)
  • you can get null-like behaviour by doing -d (works for indirect reasons)
you probably want to empty IFS= as well when you do this


For some basic read intro, consider:

var="one two three"

read -r col1 col2 col3 <<< "$var"
echo $col1
echo $col2
echo $col3

read -r <<< "$var"
echo $REPLY
cat /etc/passwd | while IFS=":" read -r name pw uid gid gecos home shell; do echo "${uid} ${name}"; done


See also:

Parallel

The most basic use is something like:

find . -name '*.html' | parallel gzip --best

The examples in the manual are good for many fancier uses.


The main comparison is to xargs:

  • parallel is also prepared to run things on multiple computers (...requires SSH keypairs and a nodefile)
  • xargs is likely to be there, parallel less so
  • parallel sometimes has nicer syntax in more complex tasks (varies. see its examples)
  • parallel can be told about files itself, instead of via stdin (means you can avoid having to think about NUL)
  • a little more control
    • of the output
    • easier replacement
    • defaults to one process per core (while xargs makes you think about -n and -P)


Keep in mind that

  • things like shell expansion and redirection should be quoted if you want parallel to execute them, instead of the shell before it hands anything to parallel.
  • things that are disk bound in a single thread will often be slower with high parallelism, particularly on platter disk



See also:

find tricks

File and path

# gunzip everything in this tree
find . -name '*.gz' -print0 | xargs -0 gunzip
-name (case sensitive) and
-iname (case insensitive) arguments
...allow use of globs evaluated by find (...assuming the shell hasn't already done so, hence the single-quoting in the example)

Matching on complete path instead of basename:

-wholename
-iwholename


Regular expressions:

-regex
-iregex
...but I find find's regexp behaviour confusing so often try to use grep instead.


Date/time

File's last modification, status-change, and access time (assuming no noatime in play)


in days using -atime, -ctime and -mtime
in minutes using -amin, -cmin and -mmin
...relative to current system time. Can be anchored to midnight instead.


fractional parts in the actual comparison are ignored (floor'd)
-number
means 'within the last ...'
+number
means 'older than ...'
number
means 'this old' (after ignoring fraction)

in more human terms, that means e.g.:

-mtime -1 means 'younger than than 24 hours'
-mtime +1 means 'older than 48 hours'
-mtime 1 due to rounding effectively means 'age 24 to 48 hours'


For example:

# files modified within the last day
find /etc -type f -ctime -1
# files older than one day
find /etc -type f -ctime +1
 
# files modified within the last ten minutes
find /etc -type f -mmin -10

# files not changed in two weeks (e.g. to look for old temporary files)
find /dev/shm /tmp -type f -ctime +14

Permissions, ownership

Permissions

Specify a bitmask, and how to use it.


There are two styles of creating that bitmask: symbolic and octal. For example, g+w,o+w and 022 are equivalent.

Remember that in octal permissions, r=4, w=2, x=1


"has at least the mentioned bits set"

# writeable by group AND other:
find /path -perm -g+w,o+w
find /path -perm -go+w
find /path -perm -022


"has any of the mentioned bits"

# writeable by group OR other   (which is possibly the test you wanted instead of the previous)
find /path -perm /g+w,o+w   
# executable by anyone:
find /path -perm /ugo+x


"has exactly these permissions", i.e. bitwise equality.

find /path -perm ugo+rwx 
find /path -perm 777


"missing these permission bits" has to be done indirectly (note: something like g-w would just test with 000)

Instead, you match all files that you don't want (that do have the bit set), and then negate the whole filter

# So if you want to test "missing group-write":
find /path ! -perm /g+w
find /path ! -perm /020

Keep in mind how the any/all logic combines with inversion when you have more than one bit. For example:

# "missing g+w OR o+w"     (logical inversion of  "has one AND the other")
find /path ! -perm -022
# "missing g+w AND o+w"    (logical inversion of  "has one OR the other")
find /path ! -perm /022



Ownership

Say you administer your web content, and sometimes do things as root. You'll inevitably have some files owned by root, not the web server's users. When that matters to you (e.g. around suid/sgid CGI)...


The more drastic version:

# listing all things where owning user isn't apache
find /var/www ! -user apache -print0 | xargs -0    
# and if you want to change:
find /var/www ! -user apache -print0 | xargs -0 chown apache

(Note that if you just want everything owned by user apache and group apache, then there is no need for filtering, and a recursive chown like chown -R apache:apache /var/www is a lot simpler)


A slightly more careful version might want to change just from root, and e.g. from unknown users (because both may come from tar), with something like:

find /var/www -group root -o -nogroup -print0 | xargs -0 chown :apache

In other words, "make the group apache in case it was in group 'root' or it had none".

nouser and nogroup is for files that have ids that do not map to a user. This (and files belonging to the wrong users) happens mostly when you extract from tarfiles that come from another system, because it stores UIDs, not names. After a userdel you may also have unknown users/groups.

other security

When there are files that don't have a valid user or group (these items show up as numbers instead of usernames), this indicates often comes from unpacking with tar (which stores only user IDs) and not having things expanded as the unpacking user. It could also mean that you clean /tmp less often than you remove users, or even that someone has intruded. Try:

find /path -nouser -or -nogroup


A file's mode consists not only of permissions but also includes entry type and things like:

1000 is sticky
2000 is sgid
4000 is suid

For example, "look for files that have SUID or SGID set":

find /usr/*bin -perm /6000 -type f -ls

Type

You can ask find for a specific type of entry.

Mostly you'll use

-type f
for file
-type d
for directory
occasionally
-type l
for symlink (has footnotes, see below)

The others: p for pipe, s for socket, c for character device, b for block device, D for door (solaris)


Example:

  • "What does the /proc directory tree look like, ignoring those PID-process things?"
find /proc -type d | egrep -v '/proc/[0-9]*($|/)' | less
  • I want to hand files to tar, but avoid directories because that would recurse them (see backup for example).



On symlinks:

  • symlink resolving
-P (default) does not follow symlinks
so combining -P -type l will list symlinks
-L follows all symlinks, before any tests. (so e.g.
-L -type l
will never report a symlink - except if it is broken)
(-follow is similar to -L (difference seems to be that symlinks specified on the command line before it will not be dereferenced(verify))
-H only resolves symlinks on the command line argument (verify), not while working
  • symlink testing:
-type l
tests whether the thing is itself a symbolic link
note: never true when using -L or -follow
{{inlinecode|-xtype is like type but if the thing is a symbolic link, reports what the thing it points to is.


Broken symlinks can be found by asking "is the thing still a link after you (failed to) follow it?"

find -L -type l  -ls
 
# if you prefer the (default) -P behaviour, then you probably want:
find . -type l -xtype l


To find symlinks pointing to some path you know (for example because you're changing that target path), try something like:

find / -type l -lname '/mnt/oldname*'

Size

Size is useful in combinations like...

# find large log files
find /var/log      -size +100M -ls

# temporary files smaller than 100 bytes (see note on c below)
 find /tmp -size -100c -ls

# find largeish temporary files that haven't changed in a month
find /tmp /var/tmp -size +30M -mtime 31 -ls

# "I can't remember what my small script file was called 
#   but want to grep as little of my filesystem for it as possible,"
find / -type f -size -10k 2>/dev/null -print0 | xargs -0 egrep '\bsomeknowncontent\b'


The units seem to be c, k, M or G.

Without a unit, the number is in 512-byte sectors. Use c to ask for (a small amounts of) bytes.


A range test is a basic implication of filters ANDing (the default). So for example, between 10kB and 20kB:

find /tmp -size +10k -size -20k

Unsorted

Parallel jobs

You can tell xargs to divide the chucks that -n implies into a number of processes.

For example,
-P 4 -n 1
for 4 workers, and handing out individual things to them.

For simple jobs on many piece of input, a large -n makes sense to avoid overhead from singular processes. On complex jobs on few inputs, -n 1 is the surest way to actually use all your workers.

(-n is bascally necessary - until xargs sees a need to chunk things up, you won't see more than one process)


find's output

When you're not going for xargs, you may want to get detailed results from find Using find -ls will give you output similar to ls -l.

You can customise it. Example

# find the most recently changed files in a subtree:
find . -type f -printf '%TY-%Tm-%Td %TT   %p\n' | sort


Non-recursive
You'll probably want
-maxdepth 1
(and sometimes combinations with -mindepth)

(note that -maxdepth 0 works only on the given arguments -- can be useful when using find only for its tests)


If your use case is more selective filtering, you are looking for -prune, which allows for selective avoidance of subdirectories (pruning subtrees).

However, you can only use it sensibly once you understand both it and find's and/or logic, so it will take reading.







Fancy constructions

With some shell-fu you can get creative, such as "grep text and html files - but not index.html - and report things that contain the word elevator in four or more lines"

find . -type f -print0 | egrep -iazZ '(\.txt|\.html?)$' | grep -vazZ 'index.html' | \
     xargs -n 1 -0 grep -c -Hi elevator | egrep -v ':[0123]$'


Issues

Number of files per command, argument ordering

Xargs hands a fair number of arguments to the same command. This is faster than running it once per filename because of various overheads in running.


This does mean that it only works with programs that take arguments in the form

command input input input

A good number of programs work like that, but certainly not all. Some take one of:

command input1 input2 .. inputn output
command singleinput 
command singleinput output

In these cases, xargs's default of adding many arguments will cause errors in the best case, or overwrite files in the worst (think of cp and mv)

The simplest solution is to force xargs to do an execution for each input:

find . | xargs -n 1 echo


In the case of where something predetermined has to come last, like cp or mv to a directory, you can do something like:

find / -iname '*.txt' | xargs --replace=@ cp @ /tmp/txt

You could use any character (instead of @ used here) that you don't otherwise use in the command to xargs (whether it appears in the filename data it hands around is irrelevant).

Character robustness

Unix filenames can have almost any character, including spaces and even newlines, which command line tools don't always like. For robustness, hand xargs filename split not with spaces or newlines, but with null (0x00) characters. Find, xargs, grep and some other utilities can do this with the appropriate arguments:

find /etc -print0 | xargs -0 file


Utilities that deal with filenames may have options to deal with nulls, but not all.

Grep can work, since you can tell it to treat the input as null-delimited (--null-data, or -z) and print that way too (-null, or -Z). You should also disable the 'is this binary data?' guess with -a, since it would otherwise sometimes say "binary file ... matches" instead of giving you matching filenames:

find /etc -print0 | grep -azZ test | xargs -0 file

Notes:

  • For xargs, -0 is the shorter form of --null. For other tools, look for -0, -null, and -z


Null aware convenience aliases

I like convenience shortcuts to do things I regularly want (and avoid trying to skip this and cheat), such as:

_find() { `which find` "$@" -print0; };
alias find0="_find "
alias xargs0="xargs -0 -n 1 "
alias grep0="grep -azZ "
alias egrep0="grep -azZE "
_locate() { `which locate` "$@" | tr '\n' '\0'; };
alias locate0="_locate "
alias sort0="sort -z "

You can use these like their originals:

find0 /dev | grep0 sd | xargs0 file

Notes:

  • ...these definitions probably need tweaking in terms of arguments, particularly for find0
  • _find has to be a function because the print0 needs to come after the expression (I'm not sure whether there are side effects; I haven't really used bash functions before). The which is there to be sure we use the actual find executable.
  • find0 and locate0 are indirectly because since function defs don't like digits, while aliases don't mind
  • You could overwrite the standard utilities behaviour to have them always be null-handling (e.g. alias xargs="xargs -0") but:
    • it may break other use (possibly scripts? There was something about aliases only being expanded in interactive bash, which probably excludes scripts), but more pressingly:
    • you need to always worry about whether you have them in place on a particular computer. I'd rather have it tell me xargs0 doesn't exist than randomly use safe and unsafe versions depending on what host and what account I run it.
  • I added one-at-a-time behaviour on xargs (-n 1) because it's saner and safer for a few cases (primarily commands that expect 'inputfile outputfile' arguments). And you can always add another -n in the actual command. (I frankly think this should be the default, because as things are, in the worst case you can destroy half your argument files, which I dislike even if it's unlikely
  • I use locate fairly regularly, and it doesn't seem to know about nulls and apparently assumes newlines don't appear in filenames (which is not strictly a guarantee, but neither is it a strange assumption). Hence the locate0 above, even if it's sort of an afterthought.
  • sort0 is there pretty much because it's easier to remember the added 0 than to remember whether the command needs to be told -0, -z, -Z, -null, --null-something) that you could add.

I'm open to further suggestions.