Search This Blog

Sunday, January 9, 2011

WGET - downloading WEB page content

I needed to download all HTML content from wikipedia's american inventors category for later processing.

A few requirements:

  • don't leave the server stuck with too many requests at a time
  • retrieve all relevant info with as little irrelevant as possible 

http://en.wikipedia.org/wiki/Category:American_inventors

Tools:  Windows Powershell & wget

Remember that besides the normal "Powershell" there is something called "Powershell ISE" which stands for Integrated Scripting Environment. It offers highlighting of powershell scripts, running and displaying results. I liked the environment very much :)

inventors txt gotten from wiki and parsed in notepad++:

inventors.txt:
--------------
http://en.wikipedia.org/wiki/Anthony_R._Barringer
http://en.wikipedia.org/wiki/Mark_B._Barron
http://en.wikipedia.org/wiki/George_Bartholomew
http://en.wikipedia.org/wiki/Hans_Baruch
[....]


script:


foreach ($webpage in Get-Content "inventors.txt")
{
    invoke-expression "wget --recursive -l 3 -U Mozilla --wait 1 ""$webpage"""
}

as simple as that :)

TIP: to enable execution of PowerShell scripts in Windows 7 run:
set-executionpolicy remotesigned


In wget:
-l 3 : means link depth, like searching the graph in DFS manner to the count of 3
--wait 1 : means wait 1 second before every request - for not to overflowing the server

1 comment:

If you like this post, please leave a comment :)