A few requirements:
- don't leave the server stuck with too many requests at a time
- retrieve all relevant info with as little irrelevant as possible
http://en.wikipedia.org/wiki/Category:American_inventors
Tools: Windows Powershell & wget
Remember that besides the normal "Powershell" there is something called "Powershell ISE" which stands for Integrated Scripting Environment. It offers highlighting of powershell scripts, running and displaying results. I liked the environment very much :)
inventors txt gotten from wiki and parsed in notepad++:
inventors.txt:
--------------
http://en.wikipedia.org/wiki/Anthony_R._Barringer
http://en.wikipedia.org/wiki/Mark_B._Barron
http://en.wikipedia.org/wiki/George_Bartholomew
http://en.wikipedia.org/wiki/Hans_Baruch
[....]
script:
foreach ($webpage in Get-Content "inventors.txt")
{
invoke-expression "wget --recursive -l 3 -U Mozilla --wait 1 ""$webpage"""
}
as simple as that :)
TIP: to enable execution of PowerShell scripts in Windows 7 run:
set-executionpolicy remotesigned
In wget:
-l 3 : means link depth, like searching the graph in DFS manner to the count of 3
--wait 1 : means wait 1 second before every request - for not to overflowing the server
thanks bro.
ReplyDelete