Do, or do not. There is no ‘try’ |

Archive for November 24th, 2008

Nov/08

24

Wget

# You have a file that contains the URLs you want to download? Use the `-i’ switch:

wget -i file

If you specify `-’ as file name, the URLs will be read from standard input.

# Create a five levels deep mirror image of the GNU web site, with the same directory structure the original has, with only one try per document, saving the log of the activities to `gnulog’:

wget -r http://www.gnu.org/ -o gnulog

# The same as the above, but convert the links in the HTML files to point to local files, so you can view the documents off-line:

wget –convert-links -r http://www.gnu.org/ -o gnulog

# Retrieve only one HTML page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

wget -p –convert-links http://www.server.com/dir/page.html

The HTML page will be saved to `www.server.com/dir/page.html’, and the images, stylesheets, etc., somewhere under `www.server.com/’, depending on where they were on the remote server.

# The same as the above, but without the `www.server.com/’ directory. In fact, I don’t want to have all those random server directories anyway–just save all those files under a `download/’ subdirectory of the current directory.

wget -p –convert-links -nH -nd -Pdownload \

http://www.server.com/dir/page.html

# Retrieve the index.html of `www.lycos.com’, showing the original server headers:

wget -S http://www.lycos.com/

# Save the server headers with the file, perhaps for post-processing.

wget -s http://www.lycos.com/
more index.html

# Retrieve the first two levels of `wuarchive.wustl.edu’, saving them to `/tmp’.

wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/

# You want to download all the GIFs from a directory on an HTTP server. You tried `wget http://www.server.com/dir/*.gif’, but that didn’t work because HTTP retrieval does not support globbing. In that case, use:

wget -r -l1 –no-parent -A.gif http://www.server.com/dir/

More verbose, but the effect is the same. `-r -l1′ means to retrieve recursively (see section 3. Recursive Retrieval), with maximum depth of 1. `–no-parent’ means that references to the parent directory are ignored (see section 4.3 Directory-Based Limits), and `-A.gif’ means to download only the GIF files. `-A “*.gif”‘ would have worked too.

# Suppose you were in the middle of downloading, when Wget was interrupted. Now you do not want to clobber the files already present. It would be:

wget -nc -r http://www.gnu.org/

# If you want to encode your own username and password to HTTP or FTP, use the appropriate URL syntax (see section 2.1 URL Format).

wget ftp://hniksic:mypassword@unix.server.com/.emacs

Note, however, that this usage is not advisable on multi-user systems because it reveals your password to anyone who looks at the output of ps.

# You would like the output documents to go to standard output instead of to files?

wget -O – http://jagor.srce.hr/ http://www.srce.hr/

You can also combine the two options and make pipelines to retrieve the documents from remote hotlists:

wget -O – http://cool.list.com/ | wget –force-html -i -

————————————–

Basics
Wget is one of the powerful tools available there to download stuff from internet. You can do a lot of things using wget. Basic use is to download files from internet.

To download a file just type

wget http://your-url-to/file

But you cannot resume broken downloads.use -c option to start resumable downloads

wget -c http://your-link-to/file

You can also mask the program as web browser using -U.
This helps when the sites doesn’t allow download managers.

wget -c -U Mozilla http://your-link-to/file

Return To Contents

Download Entire Website
You can download an entire website using -r option.

wget -r http://your-site.com

But be careful. It downloads the entire website for you. Since this tool can put a large load on servers it obeys robot.txt you can mirror a site on you local drive using -m option.

wget -m http://your-site.com

You can select the levels up to which you can dig into the site and downloads using -l option.

wget -r -l3 http://your-site.com

This will download only up to 3 levels. Suppose you want download only sub folders in a website url use –no-parent option. With this option wget downloads only the sub folders and ignores,the parent folders

wget -r –no-parent http://your-site.com/subfldr/subfolder

Now coming to terrible ideas.. to the hell with webmasters, not allowing to download the website type to ignore the robots.txt.

wget -r -U Mozilla -erobots=off http://url-to-site/

p.s. masking like a browser is a crime in some countries…. or something like that, i have heard on net.

Return To Contents

Fooling the Webmasters
Do you think the web master cannot stop u with above command. to fool him use

wget -r -U Mozilla -erobots=off -w 5 –limit-rate=20 http://url-to-site/

here -w 5 instructs wget to wait 5 secs before downloading another file and –limit-rate=20 makes wget to cap the download speed to 20KBps. So u can fool the webmaster ….

Return To Contents

Download all PDFs
You can download all files of a particular format , like all pdfs listed on a webpage,

wget -r -l1 -A.pdf –no-parent http://url-to-webpage-with-pdfs/

This is most useful for students. When they find a webpage of a professor with the files they can use this command to download all pdfs or lecture notes.
————————————————————————————————————————

Let wget working after log out from ssh connection

I usually connect through ssh to my office (better ADSL than my home’s) and download the files there over the night, the next day I bring them home.

So, to make wget continue working after the log out, because I do not want to let my home PC on all night long, so the command is:

wget -b http://some.server.com/file
Logging the output to a file

This is useful when you are working with wget in the background, to be able to know what was wrong if anything goes wrong, use the -o option and specify a file to store the logs.

wget http://some.server.com/file -o $HOME/log.txt

Of course you can combine the options, and put something like this:

wget -b -c http://some.server.com/file –limit-rate=20K -o $HOME/log.txt

No tags

Designed by devolux