Find and xargs and parallel

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Linux-related notes
Linux user notes

Shell, admin, and both:

Shell - command line and bash notes · shell login - profiles and scripts ·· find and xargs and parallel · screen and tmux ·· Shell and process nitty gritty ·· Isolating shell environments ·· Shell flow control notes


Linux admin - disk and filesystem · Linux networking · Init systems and service management (upstart notes, systemd notes) · users and permissions · Debugging · security enhanced linux · PAM notes · health and statistics · Machine Check Events · kernel modules · YP notes · unsorted and muck


Logging and graphing - Logging · RRDtool and munin notes
Network admin - Firewalling and other packet stuff ·


Remote desktops



xargs

xargs

reads arguments from stdin
builds commands with it
and executes those commands.
and optionally run these commands in a pool of parallel processes, though have a look at [[parallel], it's sometimes nicer


While you can feed in any text, people often specifically feed filenames to it to run batch jobs.

And that's often output from find.


This is preferable over using output from ls, in that it's is

  • ls can have the argument list too long problem e.g. when you run that in a directory with a million files. Find avoids that by writing filenames into a pipe, xargs avoids that by being aware of the longest command length for a single exec call and making multiple when necessary)}}
  • ls makes it harder to deal with filenames containing spaces and other interesting characters - by using NUL delimiting:
-print0 on find
-0 on xargs

The same is possible with plain shell-fu, but harder.


(And yes, you find has its own -exec (e.g. find /proc -exec ls '{}' \;), also safe around filenames and also avoiding the argument list too long problem, so you often don't need xargs at all -- but I find find's syntax to be finicky, and xargs adds the parallel thing)



For example, to print all filenames under a path (because xarg's default command is /bin/echo):

 find /proc | xargs

If you expect spaces (or, as a good habit, always), use:

 find /proc -print0 | xargs -0


There are some useful arguments to know:

  • -n number - each command is built with this many files at most. (Default is as many as the underlying exec allows?(verify))
some programs always want to be run with a single file being specified, in which case you need -n 1
or two names, e.g. programs in the form of cmd readfromthis writetothis, in which case you often want -n 1 and some careful formatting of the command (look e.g. for xargs's --replace (-i))
  • -P number - spawn a pool of this many subprocesses
You then usually want to -n with a lowish number, because it hands this many filenames to each of the parallel processes (and again, sometimes -n 1)
see also parallel, which is sometimes simpler
  • -t: print the commands before they are run
can be useful for debug.
I often forget this and stick echo at the start of what xargs should do



Multiple commands on each filename

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


I sometimes remember that parallel exists, and is nicer for these things (and can sometimes be cleaner without find)

ls | parallel 'ls -l {}; vmtouch {};'


With xargs, you probably want to explicitly wrap it in a subshell, like:

ls | xargs --replace=@  sh -c 'ls -l "@" ; vmtouch "@"'

...because in the immediate sense it's really just a single command, which happens to run multiple others.

The doublequotes there are one way to deal with arguments that have spaces, whether you use -0 or not.

Note that handing a single argument to a simple script is sometimes less error prone.

read as a xargsalike

Bash has a builtin called read, which reads one line from stdin into one or more variables. It does Command_line_and_bash_notes#word_splitting according to IFS.

If no variable name(s) are given, it will set into shell variable REPLY (not POSIX, but you can still assume it), without word-splitting.

(Keep in mind you may want to double-quote "$REPLY" when you use it to avoid whitespace stripping)

A slightly more standard way would be to empty IFS and set a single variable


Combined with while, you can use this to stream output into a bash code block. For example:

find . -print0 | while IFS= read -r -d '' line ; do
  echo "${line}";
done

This can sometimes be less weird than xargs (e.g. escaping-wise) when you want to do nontrivial things.


Notes:

  • you probably always want -r. ((verify) why)
  • you can get null-like behaviour by doing -d (works for indirect reasons)
you probably want to empty IFS= as well when you do this


For some basic read intro, consider:

var="one two three"

read -r col1 col2 col3 <<< "$var"
echo $col1
echo $col2
echo $col3

read -r <<< "$var"
echo $REPLY
cat /etc/passwd | while IFS=":" read -r name pw uid gid gecos home shell; do echo "${uid} ${name}"; done


See also:

Parallel

A basic use would be something like:

find . -name '*.html' | parallel gzip --best

The examples in the manual cover varied uses.


The obvious comparison is to xargs, so some notes:

  • xargs is very likely installed, parallel less common
  • parallel is also prepared to run things on multiple computers
...requires SSH keypairs and a nodefile)
  • parallel seems to default to 'one per core', xargs needs an explicit -P to get parallelism
  • xargs will shove many arguments in a single command
which is faster when correct (a lot fewer processes ), but sometimes constructs a command that is wrong (and may kill data if you're not thinking) and you need -n 1 to avoid that
...so parallel only does that when you use -m
  • parallel offers
    • a little more control of the output (and some though of formatting filenames)
    • easier replacement in commands
    • defaults to one process per core (while xargs makes you think about -n and -P combinations)
  • parallel can be told about files itself, instead of via stdin from find or similar
means you can avoid having to think about NUL
though it's a new thing to learn if you've used find a lot anyway
  • in some complex tasks, parallel can have nicer syntax
varies, really. see its examples



Keep in mind that

  • (like with any shell command) things like shell expansion and redirection should be quoted if you want parallel to execute them, instead of the shell before it hands anything to parallel.
  • things that are disk bound in a single thread will often be slower with high parallelism, particularly on platter disk



See also:

find tricks

File and path

# gunzip everything in this tree
find . -name '*.gz' -print0 | xargs -0 gunzip

You may frequently use:

-name (case sensitive) and
-iname (case insensitive) arguments
...allow use of globs evaluated by find (...assuming the shell hasn't already done so, hence the single-quoting in the example)

Matching on complete path instead of basename:

-wholename
-iwholename


Regular expressions:

-regex
-iregex
...but I find find's regexp behaviour confusing so often try to use grep instead.


Date/time

File's last modification, status-change, and access time (assuming no noatime in play)


in days using -atime, -ctime and -mtime
in minutes using -amin, -cmin and -mmin
...relative to current system time. Can be anchored to midnight instead.


fractional parts in the actual comparison are ignored (floor'd)
-number means 'within the last ...'
+number means 'older than ...'
number means 'this old' (after ignoring fraction)

in more human terms, that means e.g.:

-mtime -1 means 'younger than than 24 hours'
-mtime +1 means 'older than 48 hours'
-mtime 1 due to rounding effectively means 'age 24 to 48 hours'


For example:

# files modified within the last day
find /etc -type f -ctime -1
# files older than one day
find /etc -type f -ctime +1

# files modified within the last ten minutes
find /etc -type f -mmin -10

# files not changed in two weeks (e.g. to look for old temporary files)
find /dev/shm /tmp -type f -ctime +14

Permissions, ownership

Permissions

Specify a bitmask, and how to use it.


There are two styles of creating that bitmask: symbolic and octal. For example, g+w,o+w and 022 are equivalent.

Remember that in octal permissions, r=4, w=2, x=1


"has at least the mentioned bits set"

# writeable by group AND other:
find /path -perm -g+w,o+w
find /path -perm -go+w
find /path -perm -022


"has any of the mentioned bits"

# writeable by group OR other   (which is possibly the test you wanted instead of the previous)
find /path -perm /g+w,o+w   
# executable by anyone:
find /path -perm /ugo+x


"has exactly these permissions", i.e. bitwise equality.

find /path -perm ugo+rwx 
find /path -perm 777


"missing these permission bits" has to be done indirectly (note: something like g-w would just test with 000)

Instead, you match all files that you don't want (that do have the bit set), and then negate the whole filter

# So if you want to test "missing group-write":
find /path ! -perm /g+w
find /path ! -perm /020

Keep in mind how the any/all logic combines with inversion when you have more than one bit. For example:

# "missing g+w OR o+w"     (logical inversion of  "has one AND the other")
find /path ! -perm -022
# "missing g+w AND o+w"    (logical inversion of  "has one OR the other")
find /path ! -perm /022



Ownership

Say you administer your web content, and sometimes do things as root. You'll inevitably have some files owned by root, not the web server's users. When that matters to you (e.g. around suid/sgid CGI)...


The more drastic version:

# listing all things where owning user isn't apache
find /var/www ! -user apache -print0 | xargs -0    
# and if you want to change:
find /var/www ! -user apache -print0 | xargs -0 chown apache

(Note that if you just want everything owned by user apache and group apache, then there is no need for filtering, and a recursive chown like chown -R apache:apache /var/www is a lot simpler)


A slightly more careful version might want to change just from root, and e.g. from unknown users (because both may come from tar), with something like:

find /var/www -group root -o -nogroup -print0 | xargs -0 chown :apache

In other words, "make the group apache in case it was in group 'root' or it had none".

nouser and nogroup is for files that have ids that do not map to a user. This (and files belonging to the wrong users) happens mostly when you extract from tarfiles that come from another system, because it stores UIDs, not names. After a userdel you may also have unknown users/groups.

other security

When there are files that don't have a valid user or group (these items show up as numbers instead of usernames), this indicates often comes from unpacking with tar (which stores only user IDs) and not having things expanded as the unpacking user. It could also mean that you clean /tmp less often than you remove users, or even that someone has intruded. Try:

find /path -nouser -or -nogroup


A file's mode consists not only of permissions but also includes entry type and things like:

1000 is sticky
2000 is sgid
4000 is suid

For example, "look for files that have SUID or SGID set":

find /usr/*bin -perm /6000 -type f -ls

Type

You can ask find for a specific type of entry.

Mostly you'll use

-type f for file
-type d for directory
occasionally -type l for symlink (has footnotes, see below)

The others: p for pipe, s for socket, c for character device, b for block device, D for door (solaris)


Example:

  • "What does the /proc directory tree look like, ignoring those PID-process things?"
 find /proc -type d | egrep -v '/proc/[0-9]*($|/)' | less
  • I want to hand files to tar, but avoid directories because that would recurse them (see backup for example).



On symlinks:

  • symlink resolving
-P (default) does not follow symlinks
so combining -P -type l will list symlinks
-L follows all symlinks, before any tests. (so e.g. -L -type l will never report a symlink - except if it is broken)
(-follow is similar to -L (difference seems to be that symlinks specified on the command line before it will not be dereferenced(verify))
-H only resolves symlinks on the command line argument (verify), not while working
  • symlink testing:
-type l tests whether the thing is itself a symbolic link
note: never true when using -L or -follow
{{inlinecode|-xtype is like type but if the thing is a symbolic link, reports what the thing it points to is.


Broken symlinks can be found by asking "is the thing still a link after you (failed to) follow it?"

# if you prefer the (default) -P behaviour, then you probably want:
find    . -type l -xtype l

# alternatively:
find -L . -type l  -ls


To find symlinks pointing to some path you know (for example because you're changing that target path), try something like:

find / -type l -lname '/mnt/oldname*'

Size

Size is useful in combinations like...

# find large log files
find /var/log      -size +100M -ls

# temporary files smaller than 100 bytes (see note on c below)
 find /tmp -size -100c -ls

# find largeish temporary files that haven't changed in a month
find /tmp /var/tmp -size +30M -mtime 31 -ls

# "I can't remember what my small script file was called 
#   but want to grep as little of my filesystem for it as possible,"
find / -type f -size -10k 2>/dev/null -print0 | xargs -0 egrep '\bsomeknowncontent\b'


The units seem to be c, k, M or G.

Without a unit, the number is in 512-byte sectors. Use c to ask for (a small amounts of) bytes.


A range test is a basic implication of filters ANDing (the default). So for example, between 10kB and 20kB:

 find /tmp -size +10k -size -20k

Unsorted

Parallel jobs

You can tell xargs to divide the chucks that -n implies into a number of processes.

For example, -P 4 -n 1 for 4 workers, and handing out individual things to them.

For simple jobs on many piece of input, a large -n makes sense to avoid overhead from singular processes. On complex jobs on few inputs, -n 1 is the surest way to actually use all your workers.

(-n is bascally necessary - until xargs sees a need to chunk things up, you won't see more than one process)


find's output

When you're not going for xargs, you may want to get detailed results from find Using find -ls will give you output similar to ls -l.

You can customise it. Example

# find the most recently changed files in a subtree:
find . -type f -printf '%TY-%Tm-%Td %TT   %p\n' | sort


Non-recursive

You'll probably want -maxdepth 1 (and sometimes combinations with -mindepth)

(note that -maxdepth 0 works only on the given arguments -- can be useful when using find only for its tests)


If your use case is more selective filtering, you are looking for -prune, which allows for selective avoidance of subdirectories (pruning subtrees).

However, you can only use it sensibly once you understand both it and find's and/or logic, so it will take reading.







Fancy constructions

With some shell-fu you can get creative, such as "grep text and html files - but not index.html - and report things that contain the word elevator in four or more lines"

 find . -type f -print0 | egrep -iazZ '(\.txt|\.html?)$' | grep -vazZ 'index.html' | \
     xargs -n 1 -0 grep -c -Hi elevator | egrep -v ':[0123]$'


Issues

find: missing argument to `-exec'

the argument to exec needs to be terminated either by

; means once per file
and because ; is interpreted by the shell in most contexts, that means \;
+ means as few times as possible


Number of files per command, argument ordering

Xargs hands a fair number of arguments to the same command. This is faster than running it once per filename because of various overheads in running.


This does mean that it only works with programs that take arguments in the form

command input input input

A good number of programs work like that, but certainly not all. Some take one of:

command input1 input2 .. inputn output
command singleinput 
command singleinput output

In these cases, xargs's default of adding many arguments will cause errors in the best case, or overwrite files in the worst (think of cp and mv)

The simplest solution is to force xargs to do an execution for each input:

 find . | xargs -n 1 echo


In the case of where something predetermined has to come last, like cp or mv to a directory, you can do something like:

 find / -iname '*.txt' | xargs --replace=@ cp @ /tmp/txt

You could use any character (instead of @ used here) that you don't otherwise use in the command to xargs (whether it appears in the filename data it hands around is irrelevant).

Character robustness

Unix filenames can have almost any character, including spaces and even newlines, which command line tools don't always like. For robustness, hand xargs filename split not with spaces or newlines, but with null (0x00) characters. Find, xargs, grep and some other utilities can do this with the appropriate arguments:

find /etc -print0 | xargs -0 file


Utilities that deal with filenames may have options to deal with nulls, but not all.

Grep can work, since you can tell it to treat the input as null-delimited (--null-data, or -z) and print that way too (-null, or -Z). You should also disable the 'is this binary data?' guess with -a, since it would otherwise sometimes say "binary file ... matches" instead of giving you matching filenames:

find /etc -print0 | grep -azZ test | xargs -0 file

Notes:

  • For xargs, -0 is the shorter form of --null. For other tools, look for -0, -null, and -z


Null aware convenience aliases

I like convenience shortcuts to do things I regularly want (and avoid trying to skip this and cheat), such as:

_find() { `which find` "$@" -print0; };
alias find0="_find "
alias xargs0="xargs -0 -n 1 "
alias grep0="grep -azZ "
alias egrep0="grep -azZE "
_locate() { `which locate` "$@" | tr '\n' '\0'; };
alias locate0="_locate "
alias sort0="sort -z "

You can use these like their originals:

find0 /dev | grep0 sd | xargs0 file

Notes:

  • ...these definitions probably need tweaking in terms of arguments, particularly for find0
  • _find has to be a function because the print0 needs to come after the expression (I'm not sure whether there are side effects; I haven't really used bash functions before). The which is there to be sure we use the actual find executable.
  • find0 and locate0 are indirectly because since function defs don't like digits, while aliases don't mind
  • You could overwrite the standard utilities behaviour to have them always be null-handling (e.g. alias xargs="xargs -0") but:
    • it may break other use (possibly scripts? There was something about aliases only being expanded in interactive bash, which probably excludes scripts), but more pressingly:
    • you need to always worry about whether you have them in place on a particular computer. I'd rather have it tell me xargs0 doesn't exist than randomly use safe and unsafe versions depending on what host and what account I run it.
  • I added one-at-a-time behaviour on xargs (-n 1) because it's saner and safer for a few cases (primarily commands that expect 'inputfile outputfile' arguments). And you can always add another -n in the actual command. (I frankly think this should be the default, because as things are, in the worst case you can destroy half your argument files, which I dislike even if it's unlikely
  • I use locate fairly regularly, and it doesn't seem to know about nulls and apparently assumes newlines don't appear in filenames (which is not strictly a guarantee, but neither is it a strange assumption). Hence the locate0 above, even if it's sort of an afterthought.
  • sort0 is there pretty much because it's easier to remember the added 0 than to remember whether the command needs to be told -0, -z, -Z, -null, --null-something) that you could add.

I'm open to further suggestions.