Scrape and download pdf files from google or bing using PowerShell

Here is a very simple script that you could execute using PowerShell ISE. It could probably be written much better, but it works. The script just uses the power of the google search engine by searching for a specific filetype. This should also work with the Bing search engine.
To make the script work, make sure you have a directory C:\temp\dwnld\ย created. Also you could easily change the regular expression pattern and the keywords.

Comments with modifications on the scripts are always welcome ๐Ÿ˜‰

$keywords = @("manual", "microsoft", "powershell")
$pattern = 'http://(.*?)[.]{1}pdf'
$storageDir = "C:\temp\dwnld\"
$filetype = "pdf"
$rand = New-Object System.Random

$keywords | foreach {
    $urlToScrapeWithKeyword = "http://www.google.be/search?hl=nl&tbo=d&biw=1229&bih=677&output=search&sclient=psy-ab&q={0}+filetype%3A{1}&btnK=" -f $_, $filetype
    $urlToScrapeWithKeyword | Out-Default
    (Invoke-WebRequest -UseBasicParsing -Uri $urlToScrapeWithKeyword).Links | select -ExpandProperty href | Get-Unique | foreach {
        if ($_ -match $pattern) {
            $Matches[0] | Out-Default
            try {
                Start-BitsTransfer $Matches[0] $storageDir
                "Download ok" | Out-Default
            } catch [exception] {
                "Download failed:" | Out-Default
                $_.Exception.Message | Out-Default
            }
            "Sleeping" | Out-Default
            Start-Sleep -s $rand.Next(20, 43)
        }
    }
}

Enjoy ๐Ÿ˜‰
Don’t forget, web scraping can be illegal! Use it with care!

Web Scraping with Perl

We need to scrape data (web scraping) from some websites with Perl for a school project.

Here is a simple script that I used to test the Web-Scraper package that can be found in CPAN.

This is how the code works:

First you have to find a website that contains your data that you want. I used the UCI ProTour website:

If you look at the source code you will notice that my data is stored in a table:

The data doesn’t have to be in a table, but it makes life just easier for this example.

This is a part of the Perl code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

This is the essential part of using the Web-Scraper module. The #UCITeamList refers to the ID in the HTML (<table id=UCITeamList …). If you wanted to locate a piece of HTML that contains a class instead of an ID than you should use a point . instead of #. It’s the same thing like using IDs and classes in CSS.

Example:

process "table.WithBorder > ...

The next thing is getting the actual data:

@ = an attribute

Example:

#  <img src="www.test.com/myfile.jpg" title="Hello" />
  'titles[]' => '@title';

TEXT = the part between two tags

Example:

#  <p>Hello, hey</p>
  'paragra[]' => 'TEXT';

The other part of the code just loops over the array with my scraped data and prints it to the screen and saves it into a file.

Also each web scraped URL in my array is scraped again. The URLs contain the team details with all riders.

Summary:

I scrape one page that contains all the teams and links to the team details. After that first scrape round I want the team detail data so I scrape each URL. Just like a mini crawler ๐Ÿ™‚

Web-Scraper (or Web::Scraper) is a very powerful package but don’t abuse this. Web Scraping can be illegal!

For more information, documentation and examples check out CPAN.

(By the way: if you want to test the Web-Scraper, please use another website instead of the UCI site because they will not like this I guess ๐Ÿ˜‰ ๐Ÿ˜€ )

Here is the code:

#!/usr/bin/perl
use warnings;
use strict;
use URI;
use Web::Scraper;

open FILE, ">file.txt" or die $!;

# website to scrape
my $urlToScrape = "http://www.uciprotour.com/templates/UCI/UCI2/layout.asp?MenuId=MTU4MzI&LangId=1";

# prepare data
my $teamsdata = scraper {
 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';
};
# scrape the data
my $res = $teamsdata->scrape(URI->new($urlToScrape));

# print the second field (the teamname)
for my $i (0 .. $#{$res->{teams}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print $res->{teams}[$i];
 print "\n";
 print FILE $res->{teams}[$i];
 print FILE "\n";
 }
}

print FILE "\n";

# loop over every team url and take all scrape all the riders from each team
for my $i ( 0 .. $#{$res->{urls}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print "\n\n";
 print $res->{teams}[$i];
 print "\n------------------\n";
 print FILE "\n\n";
 print FILE $res->{teams}[$i];
 print FILE "\n------------------\n";

 # prepare data
 my $rennersdata = scraper {
 # rider name
 process "table#TeamRiders > tr > td.RiderCol > a", 'renners[]' => 'TEXT';
 # rider country
 process "table#TeamRiders > tr > td.CountryCol > a", 'landrenner[]' => 'TEXT';
 # rider birthdate
 process "table#TeamRiders > tr > td.DOBCol > a", 'geboortedatums[]' => 'TEXT';
 # team address
 process "table#TeamLeft > div.AddLine", 'AddressLines[]' => 'TEXT';
 };
 # scrape
 my $res2 = $rennersdata->scrape(URI->new($res->{urls}[$i]));

 for my $j (0 .. $#{$res2->{renners}}) {
 # print rider name
 print $res2->{renners}[$j];
 print "\n";
 print FILE $res2->{renners}[$j];
 print FILE "\n";

 }
 }
 # DONT FORGET THIS, this will make your script slow
 # but if it's not there you will be "attacking" the webserver and they don't like that
 sleep(3);
}

# close the file
close FILE;

Enjoy ๐Ÿ˜‰

Update 18/02/2013:
User Wisnoskij suggested an optimization. (Thanks!)
This code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

can be replaced with a single line:

 # we will save the urls from the teams and the team names
process โ€œtable#UCITeamList > tr > td > aโ€, 'urls[]' => โ€˜@hrefโ€™, 'teams[]' => โ€˜TEXTโ€™;