Web Scraping with Perl

We need to scrape data (web scraping) from some websites with Perl for a school project.

Here is a simple script that I used to test the Web-Scraper package that can be found in CPAN.

This is how the code works:

First you have to find a website that contains your data that you want. I used the UCI ProTour website:

If you look at the source code you will notice that my data is stored in a table:

The data doesn’t have to be in a table, but it makes life just easier for this example.

This is a part of the Perl code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

This is the essential part of using the Web-Scraper module. The #UCITeamList refers to the ID in the HTML (<table id=UCITeamList …). If you wanted to locate a piece of HTML that contains a class instead of an ID than you should use a point . instead of #. It’s the same thing like using IDs and classes in CSS.

Example:

process "table.WithBorder > ...

The next thing is getting the actual data:

@ = an attribute

Example:

#  <img src="www.test.com/myfile.jpg" title="Hello" />
  'titles[]' => '@title';

TEXT = the part between two tags

Example:

#  <p>Hello, hey</p>
  'paragra[]' => 'TEXT';

The other part of the code just loops over the array with my scraped data and prints it to the screen and saves it into a file.

Also each web scraped URL in my array is scraped again. The URLs contain the team details with all riders.

Summary:

I scrape one page that contains all the teams and links to the team details. After that first scrape round I want the team detail data so I scrape each URL. Just like a mini crawler :)

Web-Scraper (or Web::Scraper) is a very powerful package but don’t abuse this. Web Scraping can be illegal!

For more information, documentation and examples check out CPAN.

(By the way: if you want to test the Web-Scraper, please use another website instead of the UCI site because they will not like this I guess ;) :D )

Here is the code:

#!/usr/bin/perl
use warnings;
use strict;
use URI;
use Web::Scraper;

open FILE, ">file.txt" or die $!;

# website to scrape
my $urlToScrape = "http://www.uciprotour.com/templates/UCI/UCI2/layout.asp?MenuId=MTU4MzI&LangId=1";

# prepare data
my $teamsdata = scraper {
 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';
};
# scrape the data
my $res = $teamsdata->scrape(URI->new($urlToScrape));

# print the second field (the teamname)
for my $i (0 .. $#{$res->{teams}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print $res->{teams}[$i];
 print "\n";
 print FILE $res->{teams}[$i];
 print FILE "\n";
 }
}

print FILE "\n";

# loop over every team url and take all scrape all the riders from each team
for my $i ( 0 .. $#{$res->{urls}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print "\n\n";
 print $res->{teams}[$i];
 print "\n------------------\n";
 print FILE "\n\n";
 print FILE $res->{teams}[$i];
 print FILE "\n------------------\n";

 # prepare data
 my $rennersdata = scraper {
 # rider name
 process "table#TeamRiders > tr > td.RiderCol > a", 'renners[]' => 'TEXT';
 # rider country
 process "table#TeamRiders > tr > td.CountryCol > a", 'landrenner[]' => 'TEXT';
 # rider birthdate
 process "table#TeamRiders > tr > td.DOBCol > a", 'geboortedatums[]' => 'TEXT';
 # team address
 process "table#TeamLeft > div.AddLine", 'AddressLines[]' => 'TEXT';
 };
 # scrape
 my $res2 = $rennersdata->scrape(URI->new($res->{urls}[$i]));

 for my $j (0 .. $#{$res2->{renners}}) {
 # print rider name
 print $res2->{renners}[$j];
 print "\n";
 print FILE $res2->{renners}[$j];
 print FILE "\n";

 }
 }
 # DONT FORGET THIS, this will make your script slow
 # but if it's not there you will be "attacking" the webserver and they don't like that
 sleep(3);
}

# close the file
close FILE;

Enjoy ;)

Update 18/02/2013:
User Wisnoskij suggested an optimization. (Thanks!)
This code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

can be replaced with a single line:

 # we will save the urls from the teams and the team names
process “table#UCITeamList > tr > td > a”, 'urls[]' => ‘@href’, 'teams[]' => ‘TEXT’;
About these ads

24 thoughts on “Web Scraping with Perl

  1. Pingback: Additional info about Perl Web Scraping « Teusje

  2. Is there any way to change the user agent used by Web::Scraper to Mozilla? The site I’m trying to scrape has prevented bots in the .htaccess.

  3. More simpler example :
    my $teamsdata = scraper {
    process ‘//table[@id="UCITeamList"]//a’, ‘urls[]‘ => ‘@href’,
    ‘teams[]‘ => ‘TEXT’
    }

    my $rennersdata = scraper {
    process ‘//table[@id="TeamRiders"]//td[@class='RiderCol']/..’, riders[] => scraper {
    process ‘./td[@class="CountryCol]‘, ‘renners’ => ‘TEXT’, # rider name
    process ‘./td[@class="DOBCol"]‘, ‘geboortedatums’ => ‘TEXT’, # rider birthdate
    process ‘./td[@class="CountryCol"]‘, ‘landrenner’ => ‘TEXT’; # rider country
    };
    process ‘//div[@class="AddrLine"]‘, ‘AddressLines[]‘ => ‘TEXT’; # team address
    };

    You can use a Firefox addon called XPath Browser to “cook” your process.
    Hope this help.

  4. I want to continue scrapping on the next page but extracting … a href = “#” … and applying this link would obviously not work. How do I proceed?

    Results by page
    1

    Next

    Last

    Also, how do I scrap a website that requires to login? ( * I have username and password) Once logged in, I want the webpage I want to scrap. Using WWW Mechanize or HTML Cookie or Cookie Jar ?

    • handling next pages you would need to know how the url is being created.
      if you find the url structure, you need to write some logic to change the page index. You can always test before you scrape if you receive a 404 error or not.
      Just be creative and try to figure out how the URL works. if it is just like site.com/page/1 site.com/page/2 then it is very easy you just have to change site.com/page/NUMBER.
      Maybe the page links are visible at the bottom and you can get the next url from there. It depends on how the site was created.

      about the login, didn’t do such things atm, if you found a good way, please post it here.

  5. Hi – thanks for nice article, does it make any difference if am extracting text from div ID/div class/span/text. Can you please post sample code to extract from it.

    for example : if my html contains

    Emp

    com: MYTEST TEXT

    How do I extract MYTEST TEXT?

    Please please throw some light..

  6. Pingback: Web Scraping with PowerShell | Teusje

  7. Not sure if anyone else has pointed this out but I think you can optimise:
    This is a part of the Perl code:
    # we will save the urls from the teams
    process “table#UCITeamList > tr > td > a”, ‘urls[]‘ => ‘@href’;
    # we will save the team names
    process “table#UCITeamList > tr > td > a”, ‘teams[]‘ => ‘TEXT’;

    To only use one line (and you do not have to quote things on the left side of =>):
    process “table#UCITeamList > tr > td > a”, urls[] => ‘@href’, teams[] => ‘TEXT’;

  8. First off I want to say terrific blog! I had a quick question which
    I’d like to ask if you do not mind. I was interested to find out how you center yourself and clear your head before writing. I’ve had
    a difficult time clearing my mind in getting my thoughts out.

    I do enjoy writing but it just seems like the first 10
    to 15 minutes tend to be lost simply just trying to figure out how to begin.

    Any suggestions or tips? Kudos!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s