Wget notes

From Helpful
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Site downloading and mirroring

If you're looking for mirroring, look at rsync.


But if you only have a HTTP interface, then wget or similar can do decently.


You can get recursion with -r and specify the depth with -l (default is 5). You may as well use the mirror options, though:

  • -m: sets a collection of options a set of options relevant to mirroring. Currently these are:
    • -r does recursive fetching - it follows links (note: consider -np)
    • -N: timestamp files (see below)
    • -l inf (fetch all, instead of using the default depth of 5)
    • --no-remove-listing (something about .listing files for FTP)

Also useful:

  • -np (--no-parent): don't go to parent directories, only follow links under the provided path. Useful when fetching just part of a site


Continuing, timestamping

  • -c (--continue): continue partial files (if server supports range fetches). Also helps avoid re-fetching files you already have. Assumes file contents did not change(verify) unless you also use -N (verify)
  • -N (--timestamping) sets the date on downloaded files according to the Last-modified header(verify). This allows later wget invocations to be semi-clever about only downloading files that have actually changed.
  • -nc (--no-clobber) will:
    • not download any new versions of files that are already here (but see notes below)
    • disable adding .1, .2 etc to files that were already downloaded

This also means that recursive fetches will use local html files to see what's not yet fetched. This makes it useful to continue an abrubtly stopped view without much redundant checking - but not to update something that may have changed, unless:

  • you combine -nc with -r means files will be overwritten (in what conditions?(verify))
  • you combine -nc with -N (and possibly also -r) means they may be overwritten, when the timestamp says they are newer.

Filters

  • -A includes only things in this accept list (comma separated suffixes/patterns - how exactly?)
  • -R excludes only things in this reject list

Also sometimes useful are:

  • -I specifies paths to fetch eg. -I '/doc,/gallery'
  • -X excludes paths, eg. -X '/cgi-bin'


When updating a local copy, filtering may remove files if it is stricter(verify).

Alterations to apply to the local copy

Note that these are meant to create more convenient copies, and is not so useful for mirroring -- it doesn't interact very well with timestamping (verify), incremental downloads, and such.

  • -k (--convert-links) will alter HTML files to change absolute links it finds to relative links so that they point to your local copy and not the online copy it came from. Only necessary when the original uses absolute URLs, which in theory is best avoided in the first place.
  • -E (--html-extension) renames things with HTML MIME types to .html (and links to these files?(verify)). This is useful when the copy on the filesystem/webserver has to be browsable (the browser/web server may not figure out that it should view things originally served with URL extensions like .asp, .php, .cgi and whatnot as HTML pages)

Local directory structure

  • -nH (--no-host-directories): don't create directory for host in URL
    • (normally you'll get something like ./example.com/contents, which works better when traversing related sites, e.g. www.example.com/, img.example.com/, but in other cases it can just be a redundant directory)
  • --cut-dirs=n: cut n leading path parts.
    • Consider wget -r -np example.com/name/app/img/stuff/ - you'll get ./name/app/img/stuff/ locally. Adding --cut-dirs=3 means you'll get ./stuff.
  • -fd (--force-directories): create local structure even on single file downloads (usually when you specify a file, it downloads into the current directory, and when you specify a directory it mirrors that directory)
  • -nd (--no-directories): download all files to one directory (not usually that useful)

Examples

wget -r -np -A 7z -o log http://static.wikipedia.org/downloads/current/en

...downloads the static wikipediae for english. It doesn't follow the browsing link up to previous/other dumps, it only fetches the .7z files (you don't need the lst files - or the html index pages), and saves the log.

Networking

  • -t n (--tries=n) retries downloading on on temporary errors (except 404 and 'connection refused'
  • --retry-connrefused: Assume "connection refused" is a temporary error, and try again
  • --wait-retry=n is a wait that applies only to retry attempts (also using linear backoff)
  • -w=n (--wait=n) waits some amount of seconds (also m (minutes), h (hours), d (days)) between retrievals. It's a way of being nice to the server when you're in no particular hurry.
  • --random-wait waits between 0 and 2* what you specified for the wait interval. The option was not meant to allow people to be asshats, but to avoid the server setting up very wide IP blocks because of just you. Play nice.


  • --limit-rate=20k limits the download speed to 20KB/s (technique is rough and bursty, though)


HTTP

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Authentication

cookies


Misc

--delete-after deletes fils right after they are downloaded. This is intended to cause proxies to pre-fetch, and can be useful for stress tests.

--spider just checks whether a page is there (perhaps useful automate link breakage checking on your site with a cronjob)

Semi-sorted

wgetting to stdout:

wget -q -O - http://example.com/foo.tar

See also