Web Scraping with PowerShell

In PowerShell v3 you have some new useful cmdlets that allow you to download and parse a website.
The code in this post will demonstrate very basic scripts that could get you started with Web Scraping.

If you don’t know if you have PowerShell v3, use this command to find out:

get-host

The first script to get you started with web scraping:

$site = Invoke-WebRequest -UseBasicParsing -Uri www.bing.com
$site.Links | Out-GridView

This will give you all the links from the given website in a gridview.

The next script will give you all the email addresses that are in a mailto: anchor:

$site = Invoke-WebRequest -UseBasicParsing -Uri www.mywebsite.net
$site.Links | foreach {
if ($_.href.ToLower().StartsWith("mailto:")) {
$_.href.SubString(7) | Out-Default
}
}

By coincidence the ‘mywebsite.net’ has anchors using the mailto: prefix.

The last script is a very cool script from StackOverflow where I just modified the url to make sure the script works in several European countries:

function Get-FlightStatus {
     param($query)
$url = "http://www.bing.com?cc=us&q=flight status for $query"
$result = Invoke-WebRequest $url
$result.AllElements |
        Where Class -eq "ans" |
        Select -First 1 -ExpandProperty innerText
}

Use it like this:
(to test you can just paste this after the function in Windows PowerShell ISE )

Get-FlightStatus LH3102

It will give you a result similar to this:

Flight status for Lufthansa 3102 
flightstats.com · 2 minutes ago   

Departing on time at 5:35 PM from HAM 
FROMHAM 
Hamburg5:35 PM 
12/30/2012Terminal 2 
TOVIE 
Vienna7:05 PM 
12/30/2012

PS C:\>

Don’t forget, web scraping can be illegal!

Have fun ;-)

Take a look at “Web Scraping with Perl” and the PowerShell tag.

Web Scraping with Perl

We need to scrape data (web scraping) from some websites with Perl for a school project.

Here is a simple script that I used to test the Web-Scraper package that can be found in CPAN.

This is how the code works:

First you have to find a website that contains your data that you want. I used the UCI ProTour website:

If you look at the source code you will notice that my data is stored in a table:

The data doesn’t have to be in a table, but it makes life just easier for this example.

This is a part of the Perl code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

This is the essential part of using the Web-Scraper module. The #UCITeamList refers to the ID in the HTML (<table id=UCITeamList …). If you wanted to locate a piece of HTML that contains a class instead of an ID than you should use a point . instead of #. It’s the same thing like using IDs and classes in CSS.

Example:

process "table.WithBorder > ...

The next thing is getting the actual data:

@ = an attribute

Example:

#  <img src="www.test.com/myfile.jpg" title="Hello" />
  'titles[]' => '@title';

TEXT = the part between two tags

Example:

#  <p>Hello, hey</p>
  'paragra[]' => 'TEXT';

The other part of the code just loops over the array with my scraped data and prints it to the screen and saves it into a file.

Also each web scraped URL in my array is scraped again. The URLs contain the team details with all riders.

Summary:

I scrape one page that contains all the teams and links to the team details. After that first scrape round I want the team detail data so I scrape each URL. Just like a mini crawler :)

Web-Scraper (or Web::Scraper) is a very powerful package but don’t abuse this. Web Scraping can be illegal!

For more information, documentation and examples check out CPAN.

(By the way: if you want to test the Web-Scraper, please use another website instead of the UCI site because they will not like this I guess ;) :D )

Here is the code:

#!/usr/bin/perl
use warnings;
use strict;
use URI;
use Web::Scraper;

open FILE, ">file.txt" or die $!;

# website to scrape
my $urlToScrape = "http://www.uciprotour.com/templates/UCI/UCI2/layout.asp?MenuId=MTU4MzI&LangId=1";

# prepare data
my $teamsdata = scraper {
 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';
};
# scrape the data
my $res = $teamsdata->scrape(URI->new($urlToScrape));

# print the second field (the teamname)
for my $i (0 .. $#{$res->{teams}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print $res->{teams}[$i];
 print "\n";
 print FILE $res->{teams}[$i];
 print FILE "\n";
 }
}

print FILE "\n";

# loop over every team url and take all scrape all the riders from each team
for my $i ( 0 .. $#{$res->{urls}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print "\n\n";
 print $res->{teams}[$i];
 print "\n------------------\n";
 print FILE "\n\n";
 print FILE $res->{teams}[$i];
 print FILE "\n------------------\n";

 # prepare data
 my $rennersdata = scraper {
 # rider name
 process "table#TeamRiders > tr > td.RiderCol > a", 'renners[]' => 'TEXT';
 # rider country
 process "table#TeamRiders > tr > td.CountryCol > a", 'landrenner[]' => 'TEXT';
 # rider birthdate
 process "table#TeamRiders > tr > td.DOBCol > a", 'geboortedatums[]' => 'TEXT';
 # team address
 process "table#TeamLeft > div.AddLine", 'AddressLines[]' => 'TEXT';
 };
 # scrape
 my $res2 = $rennersdata->scrape(URI->new($res->{urls}[$i]));

 for my $j (0 .. $#{$res2->{renners}}) {
 # print rider name
 print $res2->{renners}[$j];
 print "\n";
 print FILE $res2->{renners}[$j];
 print FILE "\n";

 }
 }
 # DONT FORGET THIS, this will make your script slow
 # but if it's not there you will be "attacking" the webserver and they don't like that
 sleep(3);
}

# close the file
close FILE;

Enjoy ;)

Update 18/02/2013:
User Wisnoskij suggested an optimization. (Thanks!)
This code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

can be replaced with a single line:

 # we will save the urls from the teams and the team names
process “table#UCITeamList > tr > td > a”, 'urls[]' => ‘@href’, 'teams[]' => ‘TEXT’;