Find and xargs and parallel

From Helpful
Jump to: navigation, search

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts ·· find and xargs and parallel · screen and tmux
Linux admin - disk and filesystem · users and permissions · Debugging · security enhanced linux · health and statistics · kernel modules · YP notes · unsorted and muck
Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·


Remote desktops
VNC notes
XDMCP notes



xargs

xargs reads arguments from stdin, build commands with it, and executes them.


It is one solution to the argument list too long problem, frequently used in combination with find, and also one way to easily-and-safely handle batch jobs on filenames that contain characters like spaces.


For example, to print all filenames (because xarg's default program is /bin/echo):

 find /proc | xargs

If you expect spaces (or, for good habit, always), use:

 find /proc -print0 | xargs -0


There are some useful arguments to know:

  • -n number
    - each command is built with this many files at most. (Default is as many as the underlying system allows?(verify))
some programs want a single input file, in which case you need -n 1
  • -P number
    - spawn a pool of this many subprocesses
You usually also want to specify -n with a moderate number -- now effectively 'hand around blocks this large'
see also parallel
  • {{comment|
    -t
    : print the commands before they are run
can be useful for debug. Though so can putting echo at the start.


(Note that you can can much of that (safe filenames, avoid argument list too long problem) with just find, using its -exec, e.g.
find /proc -exec ls '{}' \;
but I find its syntax hard to get right.


Multiple commands on each filename

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Robust and easy-ish to remember is to wrap in a shell, like:

ls | xargs --replace=@  sh -c 'ls -l @ ; vmtouch @'

...because in the immediate sense it's really just a single command, which happens to run multiple others.

If things get more complicated, write a simple script.


I sometimes remember that parallel is nicer for these things (and can sometimes be cleaner without find)

ls | parallel 'ls -l {}; vmtouch {};'


read as a xargsalike

Bash has a builtin called
read
, which reads one line from stdin into one or more variables.

It does Command_line_and_bash_notes#word_splitting according to IFS.

If no variable name(s) are given, it will set into shell variable REPLY (not POSIX, but you can still assume it), without word-splitting.

(Keep in mind you may want to double-quote "$REPLY" when you use it to avoid whitespace stripping)

A slightly more standard way would be to empty IFS and set a single variable


Combined with while, you can use this to stream output into a bash code block. For example:

find . -print0 | while IFS= read -r -d '' line ; do
  echo "${line}";
done

This can sometimes be less weird than xargs (e.g. escaping-wise) when you want to do nontrivial things.


Notes:

  • you probably always want -r. ((verify) why)
  • you can get null-like behaviour by doing -d (works for indirect reasons)
you probably want to empty IFS= as well when you do this


For some basic read intro, consider:

var="one two three"

read -r col1 col2 col3 <<< "$var"
echo $col1
echo $col2
echo $col3

read -r <<< "$var"
echo $REPLY
cat /etc/passwd | while IFS=":" read -r name pw uid gid gecos home shell; do echo "${uid} ${name}"; done


See also:

Parallel

The examples in the manual are decent (if a little overcomplete).


The main comparison is to xargs:

  • parallel could run things on multiple computers
requires keypairs, and uses a nodefiles
  • xargs is pretty likely to be there, parallel less so
  • syntax for more complex tasks can be less messy with parallel
  • can be told about files itself (and when avoiding find, avoids the need for thinking about NUL stuff)
  • a little more control
    • of the output
    • easier replacement
    • defaults to one process per core (while xargs makes you think about -n and -P)


Keep in mind that

  • things like shell expansion and redirection should be quoted if you want parallel to execute them, instead of the shell before it hands anything to parallel.
  • things that are disk bound in a single threadwill be slower with high parallelism



See also:

find tricks

File and path

# gunzip everything in this tree
find . -name '*.gz' -print0 | xargs -0 gunzip

The -name (case sensitive) and -iname (case insensitive) arguments allow you to use globs, which are evaluated by find (...assuming the shell hasn't already done so, hence the single-quoting)


Full path (globs)

-wholename, -iwholename

Like name/iname, but looks in the whole (absolute(verify)) path.

Date/time

You can check for files's last-access time, status-change or data modification time

  • in days using -atime, -ctime and -mtime
  • in minutes using -amin, -cmin and -mmin
  • Negative numbers mean 'in the last ...' (relative to current system time. Can be anchored to midnight instead)
  • Positive numbers mean 'older than ...'

For example:

# files modified within the last day
find /etc -type f -ctime -1

# files modified within the last ten minutes
find /etc -type f -mmin -10

# files not changed in two weeks (e.g. to look for old temporary files)
find /dev/shm /tmp -type f -ctime +14


Permissions, ownership

Permissions

Specify a bitmask, and how to use it.


There are two styles of creating that bitmask: symbolic and octal. For example, g+w,o+w and 022 are equivalent.

Remember that in octal permissions, r=4, w=2, x=1


"has at least the mentioned bits set"

# writeable by group AND other:
find /path -perm -g+w,o+w
find /path -perm -go+w
find /path -perm -022


"has any of the mentioned bits"

# writeable by group OR other   (which is possibly the test you wanted instead of the previous)
find /path -perm /g+w,o+w   
# executable by anyone:
find /path -perm /ugo+x


"has exactly these permissions", i.e. bitwise equality.

find /path -perm ugo+rwx 
find /path -perm 777


"missing these permission bits" has to be done indirectly (and something like g-w would just test with 000)

Instead, you match all files that you don't want (that do have the bit set), and then negate the whole filter

# So if you want to test "missing group-write":
find /path ! -perm /g+w
find /path ! -perm /020

Keep in mind how the any/all logic combines with inversion when you have more than one bit. For example:

# "missing g+w OR o+w"     (logical inversion of  "has one AND the other")
find /path ! -perm -022
# "missing g+w AND o+w"    (logical inversion of  "has one OR the other")
find /path ! -perm /022



Ownership

Say you administer your web content as root. You'll inevitably have some things owned not by the web server's users but by root. This doesn't always matter - web servers often just read content, and files often have world-read permissions. Still, if you have any CGI scripts, there may be suid/sgid bits and you may as well play safe.

A simple and drastic check: all users that "anything that isn't apache":

find /var/www ! -user apache -print0 | xargs -0 # chown apache

(Note that if you just want everything owned by user apache and group apache, then there is no need for filtering, and a recursive chown like chown -R apache:apache /var/www is a lot simpler)


You may also want to deal with unknown users, with something like

find /var/www -group root -o -nogroup -print0 | xargs -0 chown :apache

In other words, "make the group apache in case it was in group 'root' or it had none".

nouser and nogroup is for files that have ids that do not map to a user. This (and files belonging to the wrong users) happens mostly when you extract from tarfiles that come from another system, because it stores UIDs, not names. After a userdel you may also have unknown users/groups.


other security

When there are files that don't have a valid user or group (these items show up as numbers instead of usernames), this indicates often comes from unpacking with tar (which stores only user IDs) and not having things expanded as the unpacking user. It could also mean that you clean /tmp less often than you remove users, or even that someone has intruded. Try:

find /path -nouser -or -nogroup


A file's mode consists not only of permissions but also includes entry type and things like:

1000 is sticky
2000 is sgid
4000 is suid

For example, "look for files that have SUID or SGID set":

find /usr/*bin -perm /6000 -type f -ls

Type

You can ask find for a specific type of entry.

Example:

  • "What does the /proc directory tree look like, ignoring those PID-process things?"
find /proc -type d | egrep -v '/proc/[0-9]*($|/)' | less
  • I want to hand files to tar, but avoid directories because that would recurse them (see backup for example).


Mostly you'll use
-type f
for file or
-type d
for directory, and occasionally
-type l
for symlink (has footnotes, see below)
The rest are p for pipe, s for socket,

c for character device, b for block device,

D for door (solaris)


On symlinks:

  • symlink resolving
-P (default) does not follow symlinks
so combining -P -type l will list symlinks
-L follows all symlinks, before any tests. (so e.g.
-L -type l
will never report a symlink - except if it is broken)
(-follow is similar to -L (difference seems to be that symlinks specified on the command line before it will not be dereferenced(verify))
-H only resolves symlinks on the command line argument (verify), not while working
  • symlink testing:
-type l
tests whether the thing is itself a symbolic link
note: never true when using -L or -follow
{{inlinecode|-xtype is like type but if the thing is a symbolic link, reports what the thing it points to is.


Broken symlinks can be found by asking "is the thing still a link after you (failed to) follow it?"

find -L -type l

# if you prefer the (default) -P behaviour, then you probably want:
find . -type l -xtype l

Size

Size can be used in queries like "find large logs" and "find lar temporary files that haven't changed in a month"

find /var/log -size +10M -ls
find /tmp /var/tmp -size +30M -mtime 31 -ls

In cases like these, getting the file listing with the -ls option is useful and saves a -print0 | xargs -0 ls -l.


"I can't remember what my small script file was called but want to grep as little of my filesystem for it as possible," that is, "grep only regular files smaller than, say, 10KB":

find / -type f -size -10k 2>/dev/null -print0 | xargs -0 egrep '\bsomeknowncontent\b'


Without k, M or G as a byte-implying unit, the number is in 512-byte sectors. You can explicitly indicate bytes by using c. This useful (necessary, even) to test for particularly small files, like:

find /tmp -size -100c


A range test is a basic implication of the default of ANDing filters. For example, between 10kB and 20kB:

find /tmp -size +10k -size -20k


links

To find symlinks pointing to some path you know -- for example because you're changing that target path, try something like:

find / -type l -lname '/mnt/oldname*'


Finding broken links can't be done with find alone. You can use something like:

find / -type l -print0 | xargs -0 file | grep broken


Unsorted

Parallel jobs

You can tell xargs to divide the chucks that -n implies into a number of processes.

For example,
-P 4 -n 1
for 4 workers, and handing out individual things to them.

For simple jobs on many piece of input, a large -n makes sense to avoid overhead from singular processes. On complex jobs on few inputs, -n 1 is the surest way to actually use all your workers.

(-n is bascally necessary - until xargs sees a need to chunk things up, you won't see more than one process)


find's output

When you're not going for xargs, you may want to get detailed results from find Using find -ls will give you output similar to ls -l.

You can customise it. Example

# find the most recently changed files in a subtree:
find . -type f -printf '%TY-%Tm-%Td %TT   %p\n' | sort


Non-recursive
You'll probably want
-maxdepth 1
(and sometimes combinations with -mindepth)

(note that -maxdepth 0 works only on the given arguments -- can be useful when using find only for its tests)


If your use case is more selective filtering, you are looking for -prune, which allows for selective avoidance of subdirectories (pruning subtrees).

However, you can only use it sensibly once you understand both it and find's and/or logic, so it will take reading.







Fancy constructions

With some shell-fu you can get creative, such as "grep text and html files - but not index.html - and report things that contain the word elevator in four or more lines"

find . -type f -print0 | egrep -iazZ '(\.txt|\.html?)$' | grep -vazZ 'index.html' | \
     xargs -n 1 -0 grep -c -Hi elevator | egrep -v ':[0123]$'


Issues

Number of files per command, argument ordering

Xargs hands a fair number of arguments to the same command. This is faster than running it once per filename because of various overheads in running.


This does mean that it only works with programs that take arguments in the form

command input input input

A good number of programs work like that, but certainly not all. Some take one of:

command input1 input2 .. inputn output
command singleinput 
command singleinput output

In these cases, xargs's default of adding many arguments will cause errors in the best case, or overwrite files in the worst (think of cp and mv)

The simplest solution is to force xargs to do an execution for each input:

find . | xargs -n 1 echo


In the case of where something predetermined has to come last, like cp or mv to a directory, you can do something like:

find / -iname '*.txt' | xargs --replace=@ cp @ /tmp/txt

You could use any character (instead of @ used here) that you don't otherwise use in the command to xargs (whether it appears in the filename data it hands around is irrelevant).

Character robustness

Unix filenames can have almost any character, including spaces and even newlines, which command line tools don't always like. For robustness, hand xargs filename split not with spaces or newlines, but with null (0x00) characters. Find, xargs, grep and some other utilities can do this with the appropriate arguments:

find /etc -print0 | xargs -0 file


Utilities that deal with filenames may have options to deal with nulls, but not all.

Grep can work, since you can tell it to treat the input as null-delimited (--null-data, or -z) and print that way too (-null, or -Z). You should also disable the 'is this binary data?' guess with -a, since it would otherwise sometimes say "binary file ... matches" instead of giving you matching filenames:

find /etc -print0 | grep -azZ test | xargs -0 file

Notes:

  • For xargs, -0 is the shorter form of --null. For other tools, look for -0, -null, and -z


Null aware convenience aliases

I like convenience shortcuts to do things I regularly want (and avoid trying to skip this and cheat), such as:

_find() { `which find` "$@" -print0; };
alias find0="_find "
alias xargs0="xargs -0 -n 1 "
alias grep0="grep -azZ "
alias egrep0="grep -azZE "
_locate() { `which locate` "$@" | tr '\n' '\0'; };
alias locate0="_locate "
alias sort0="sort -z "

You can use these like their originals:

find0 /dev | grep0 sd | xargs0 file

Notes:

  • ...these definitions probably need tweaking in terms of arguments, particularly for find0
  • _find has to be a function because the print0 needs to come after the expression (I'm not sure whether there are side effects; I haven't really used bash functions before). The which is there to be sure we use the actual find executable.
  • find0 and locate0 are indirectly because since function defs don't like digits, while aliases don't mind
  • You could overwrite the standard utilities behaviour to have them always be null-handling (e.g. alias xargs="xargs -0") but:
    • it may break other use (possibly scripts? There was something about aliases only being expanded in interactive bash, which probably excludes scripts), but more pressingly:
    • you need to always worry about whether you have them in place on a particular computer. I'd rather have it tell me xargs0 doesn't exist than randomly use safe and unsafe versions depending on what host and what account I run it.
  • I added one-at-a-time behaviour on xargs (-n 1) because it's saner and safer for a few cases (primarily commands that expect 'inputfile outputfile' arguments). And you can always add another -n in the actual command. (I frankly think this should be the default, because as things are, in the worst case you can destroy half your argument files, which I dislike even if it's unlikely
  • I use locate fairly regularly, and it doesn't seem to know about nulls and apparently assumes newlines don't appear in filenames (which is not strictly a guarantee, but neither is it a strange assumption). Hence the locate0 above, even if it's sort of an afterthought.
  • sort0 is there pretty much because it's easier to remember the added 0 than to remember whether the command needs to be told -0, -z, -Z, -null, --null-something) that you could add.

I'm open to further suggestions.