Wget notes
From Helpful
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Note to admins: if you want efficient and easier mirroring, consider setting up rsync.
Contents |
Site downloading and mirroring
You can cause recursion with -r and specify the depth with -l (default is 5). You may as well use the mirror options, though:
- -m: sets a collection of options a set of options relevant to mirroring. Currently these are:
- -r does recursive fetching - it follows links (see also -np)
- -N: timestamp files (see below)
- -l inf (fetch all, instead of using the default depth of 5)
- --no-remove-listing (something about .listing files for FTP)
Also useful:
- -np (--no-parent): don't go to parent directories - useful when fetching just part of a site
Continuing, timestamping
- -c (--continue): continue partial files (if server supports). (a general feature, that is even more useful) Also helps to not avoid re-fetching what you already have. Assumes file contents did not change(verify) unless you also use -N (verify)
- -N (--timestamping) sets the date on downloaded files according to the Last-modified header(verify). This allows later wgets to be clever about only downloading files that have actually changed.
- -nc (--no-clobber) will:
- disable adding .1, .2 etc to files that were already downloaded
- not download any new versions of files that are already here (but see notes below)
This also means that recursive fetches will use local html files to see what's not yet fetched. This makes it useful to continue an abrubtly stopped view without much redundant checking - but not to update something that may have changed, unless:
- you combine -nc with -r means files will be overwritten (in what conditions?(verify))
- you combine -nc with -N (and possibly also -r) means they may be overwritten, when the timestamp says they are newer.
Filters
- -A includes only things in this accept list (comma separated suffixes/patterns - how exactly?)
- -R excludes only things in this reject list
Also sometimes useful are:
- -I specifies paths to fetch eg. -I '/doc,/gallery'
- -X excludes paths, eg. -X '/cgi-bin'
When updating a local copy, filtering may remove files if it is stricter(verify).
Alterations to apply to the local copy
-k (--convert-links) takes absolute links in HTML and makes them relative, which is useful to make mirrored copies work regardless of where they are placed, and avoids links pointing to the online copy it came from. (Apparently this doesn't play too well with timestamping, though (verify))
-E (--html-extension) renames things with HTML MIME types to .html. This is useful when the copy on the filesystem/webserver has to be browsable (the browser/web server may not figure out that it should view things originally served with URL extensions like .asp, .php, .cgi and whatnot as HTML pages)
Local directory structure
- -nH (--no-host-directories): don't create directory for host in URL
- --cut-dirs=n: cut n leading path parts. Useful to avoid creating deep local directories when downloading from deep locations on a server.
- -fd (--force-directories): create local structure even on single file downloads
- -nd (--no-directories): download all files to one directory (not usually that useful)
Examples
wget -r -np -A 7z -o log http://static.wikipedia.org/downloads/current/en
...downloads the static wikipediae for english. It doesn't follow the browsing link up to previous/other dumps, it only fetches the .7z files (you don't need the lst files - or the html index pages), and saves the log.
Networking
- -t n (--tries=n) retries downloading on on temporary errors (except 404 and 'connection refused'
- --retry-connrefused: Assume "connection refused" is a temporary error, and try again
- --wait-retry=n is a wait that applies only to retry attempts (also using linear backoff)
- -w=n (--wait=n) waits some amount of seconds (also m (minutes), h (hours), d (days)) between retrievals. It's a way of being nice to the server when you're in no particular hurry.
- --random-wait waits between 0 and 2* what you specified for the wait interval. The option was not meant to allow people to be asshats, but to avoid the server setting up very wide IP blocks because of just you. Play nice.
- --limit-rate=20k limits the download speed to 20KB/s (technique is rough and bursty, though)
HTTP
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Authentication
cookies
Misc
--delete-after deletes fils right after they are downloaded. This is intended to cause proxies to pre-fetch, and can be useful for stress tests.
--spider just checks whether a page is there (perhaps useful automate link breakage checking on your site with a cronjob)
Semi-sorted
wgetting to stdout:
wget -q -O - http://example.com/foo.tar

