Search This Blog

Sunday, January 9, 2011

WGET - downloading WEB page content

I needed to download all HTML content from wikipedia's american inventors category for later processing.

A few requirements:

  • don't leave the server stuck with too many requests at a time
  • retrieve all relevant info with as little irrelevant as possible

Tools:  Windows Powershell & wget

Remember that besides the normal "Powershell" there is something called "Powershell ISE" which stands for Integrated Scripting Environment. It offers highlighting of powershell scripts, running and displaying results. I liked the environment very much :)

inventors txt gotten from wiki and parsed in notepad++:



foreach ($webpage in Get-Content "inventors.txt")
    invoke-expression "wget --recursive -l 3 -U Mozilla --wait 1 ""$webpage"""

as simple as that :)

TIP: to enable execution of PowerShell scripts in Windows 7 run:
set-executionpolicy remotesigned

In wget:
-l 3 : means link depth, like searching the graph in DFS manner to the count of 3
--wait 1 : means wait 1 second before every request - for not to overflowing the server


If you like this post, please leave a comment :)