about:benjie

Random learnings and other thoughts from an unashamed geek

Killer Recursive Download Command

| Comments

I made this command to download a series of websites including all files, and to do so without being stopped by any automated protection methods (e.g. robots.txt, request frequency analysis, …). It served its purpose well. The command is this:

“ wget -o log -erobots=off -r -t 3 -N -w 2 –random-wait –retry-connrefused –protocol-directories –ignore-length –user-agent=“Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.8.1.2) Gecko/20060601 Firefox/2.0.0.2 (Ubuntu-edgy)” -l 100 -E -k -K -p http://web.site.here/

Quite long, don’t ya think? Read on for a description of what it does (that is, if you don’t have the wget manpage memorized…)

Basically it does this:

  • -o log –> outputs messages to log file “log” instead of the terminal. Allows for easier debugging (and you can tail -F it in another terminal anyway…)
  • -erobots=off –> tells wget to NOT respect the robots.txt file - i.e. abuse webservers. This is naughty, but was necessary for the sites I wanted.
  • -r –> recursive
  • -t 3 –> retry each URL 3 times
  • -N –> turn on timestamping
  • -w 2 –> wait 2 seconds between requests
  • –random-wait –> instead of waiting 2 seconds, wait a random time, that averages to 2 seconds
  • –retry-connrefused –> for sites that go down frequently (e.g. ones on home computers) use this to retry if the connection is refused, rather than just skipping
  • –protocol-directories –> makes files from http://web.site.here/dir/ be laid out as ./http/web.site.here/dir/ e.g. puts https files in a different directory to http files
  • –ingore-length –> If the webserver gives the wrong Length: header, ignore it.
  • –user-agent=“” –> pretend to be Firefox rather than wget - prevents connections being refused by anti-spidering measures that filter just by user agent.
  • -l 100 –> go up to 100 levels deep
  • -E –> put .html on the end of files that don’t end with HTML but are HTML files. This is useful so that you can download a dynamic site and stick it in a static webserver and it still works
  • -k –> convert links - VERY ADVANCED - edits the HTML files and changes the links to be relative and point to the right file names (e.g. adds .html for the last option). NOTE: this only works for HTML files - you will have to modify the CSS or JavaScript yourself.\
  • -K –> when doing -k, keep a backup of the original file, without links converted.\
  • -p –> download everything needed to properly display the files you download. e.g. all style sheets, JS files, images, …\

So as you can see, a lot of work went into making it (I read the entire wget man page) but I think it was worth it. I now keep it in a file so whenever I need to do the same thing again, I have it. And now I have a blog, I thought “why not share?!” :-)

Enjoy. And please tell me if you find it useful.

Comments