Hobbits would make great programmers

The “Lord of the Rings” characters embody what Wall sees as the three virtues of good programmers: laziness, impatience, and hubris.

[ wikipedia ]

Additional info about Perl Web Scraping

In case you need some additional information about Web Scraping with Perl (aka Screen Scraping) with the Web::Scraper module:

πŸ˜‰

Web Scraping with Perl

We need to scrape data (web scraping) from some websites with Perl for a school project.

Here is a simple script that I used to test the Web-Scraper package that can be found in CPAN.

This is how the code works:

First you have to find a website that contains your data that you want. I used the UCI ProTour website:

If you look at the source code you will notice that my data is stored in a table:

The data doesn’t have to be in a table, but it makes life just easier for this example.

This is a part of the Perl code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

This is the essential part of using the Web-Scraper module. The #UCITeamList refers to the ID in the HTML (<table id=UCITeamList …). If you wanted to locate a piece of HTML that contains a class instead of an ID than you should use a point . instead of #. It’s the same thing like using IDs and classes in CSS.

Example:

process "table.WithBorder > ...

The next thing is getting the actual data:

@ = an attribute

Example:

#  <img src="www.test.com/myfile.jpg" title="Hello" />
  'titles[]' => '@title';

TEXT = the part between two tags

Example:

#  <p>Hello, hey</p>
  'paragra[]' => 'TEXT';

The other part of the code just loops over the array with my scraped data and prints it to the screen and saves it into a file.

Also each web scraped URL in my array is scraped again. The URLs contain the team details with all riders.

Summary:

I scrape one page that contains all the teams and links to the team details. After that first scrape round I want the team detail data so I scrape each URL. Just like a mini crawler πŸ™‚

Web-Scraper (or Web::Scraper) is a very powerful package but don’t abuse this. Web Scraping can be illegal!

For more information, documentation and examples check out CPAN.

(By the way: if you want to test the Web-Scraper, please use another website instead of the UCI site because they will not like this I guess πŸ˜‰ πŸ˜€ )

Here is the code:

#!/usr/bin/perl
use warnings;
use strict;
use URI;
use Web::Scraper;

open FILE, ">file.txt" or die $!;

# website to scrape
my $urlToScrape = "http://www.uciprotour.com/templates/UCI/UCI2/layout.asp?MenuId=MTU4MzI&LangId=1";

# prepare data
my $teamsdata = scraper {
 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';
};
# scrape the data
my $res = $teamsdata->scrape(URI->new($urlToScrape));

# print the second field (the teamname)
for my $i (0 .. $#{$res->{teams}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print $res->{teams}[$i];
 print "\n";
 print FILE $res->{teams}[$i];
 print FILE "\n";
 }
}

print FILE "\n";

# loop over every team url and take all scrape all the riders from each team
for my $i ( 0 .. $#{$res->{urls}}) {
 if ($i%3 != 0 && $i%3 != 2) {
 print "\n\n";
 print $res->{teams}[$i];
 print "\n------------------\n";
 print FILE "\n\n";
 print FILE $res->{teams}[$i];
 print FILE "\n------------------\n";

 # prepare data
 my $rennersdata = scraper {
 # rider name
 process "table#TeamRiders > tr > td.RiderCol > a", 'renners[]' => 'TEXT';
 # rider country
 process "table#TeamRiders > tr > td.CountryCol > a", 'landrenner[]' => 'TEXT';
 # rider birthdate
 process "table#TeamRiders > tr > td.DOBCol > a", 'geboortedatums[]' => 'TEXT';
 # team address
 process "table#TeamLeft > div.AddLine", 'AddressLines[]' => 'TEXT';
 };
 # scrape
 my $res2 = $rennersdata->scrape(URI->new($res->{urls}[$i]));

 for my $j (0 .. $#{$res2->{renners}}) {
 # print rider name
 print $res2->{renners}[$j];
 print "\n";
 print FILE $res2->{renners}[$j];
 print FILE "\n";

 }
 }
 # DONT FORGET THIS, this will make your script slow
 # but if it's not there you will be "attacking" the webserver and they don't like that
 sleep(3);
}

# close the file
close FILE;

Enjoy πŸ˜‰

Update 18/02/2013:
User Wisnoskij suggested an optimization. (Thanks!)
This code:

 # we will save the urls from the teams
 process "table#UCITeamList > tr > td > a", 'urls[]' => '@href';
 # we will save the team names
 process "table#UCITeamList > tr > td > a", 'teams[]' => 'TEXT';

can be replaced with a single line:

 # we will save the urls from the teams and the team names
process β€œtable#UCITeamList > tr > td > a”, 'urls[]' => β€˜@href’, 'teams[]' => β€˜TEXT’;

I Love Errors #13

It’s been a while but I’ve collected some fails:

Keep on failing πŸ˜€

Perl: Storing arrays in hashes

Let’s start with a quote from ‘Learning Perl‘:

What Is a Hash?

A hash is a data structure, not unlike an array in that it can hold any number of values and retrieve them at will. But instead of indexing the values by number, as we did with arrays, we’ll look up the values by name. That is, the indices (here, we’ll call them keys ) aren’t numbers, but instead they are arbitrary unique strings

hash

 

Our goal: We want to store an array as a value in the hash instead of saving a single value.

Let’s write some code:

#!/usr/bin/perl
use strict;
use warnings;

my %hash = ();
my @array = (1, 2, 3, 4, 5);
my $aref = \@array;
$hash{"Testing"} = $aref;
push(@{$hash{"Testing"}},6);
print join " ",@{$hash{"Testing"}};

This piece of Perl code stores an array as a value in a hash and adds a new element to the array.

Now let’s analyze the code.

First we create an empty hash and we create an array and initialize it.

my %hash = ();
my @array = (1, 2, 3, 4, 5);

empty-hash-and-array
Now we need a reference of our array (so that we can store it as a value in the hash):

my $aref = \@array;

reference to array
In the next step below we actually do what we wanted. Storing an array into a hash value. We create a key in our hash called “Testing” and store the reference of our array in the value field.

$hash{"Testing"} = $aref;

hash-array-ref2
We can see that the reference of our array ($aref) is pointing to our array so we can access the array in our hash. πŸ™‚

This code adds a new element with value '6' to our array:
push(@{$hash{"Testing"}},6);

hash-aray-push-item

Explanation:

# Gives the value of key Testing, it's the reference to the array
$hash{"Testing"}

# Dereference it with {} to access the array (that's why we have an @ at the begin)
@{$hash{"Testing"}}

I hope that helped πŸ™‚

Perl: The spaceship operator

At school we learned a new operator for Perl called the “spaceship operator“.Β  It’s just this: <=>
Let your imagination speak πŸ™‚

The operator compares 2 internal variables $a and $b.
You can’t name them $x and $y, it’s predefined in Perl that you should use the operator with variables $a and $b.

In the example below I sort 2 arrays:

#!/usr/bin/perl
use strict;
use warnings;

my (@array1, @array1_sorted, @array2, @array2_sorted);

@array1 = @array2 = (10, 5, -2, -11, -35, 7, 1, 0, 6);

# ----------------------------------------------------
# Method 1
# @sortedArray = (sort { condition(s) } @arrayToSort);
# ----------------------------------------------------
@array1_sorted = (sort
		  {
		    if ($a < $b)
		    {
		      return -1;
		    }
		    elsif ($a > $b)
		    {
		      return 1;
		    }
		    else
		    {
		      return 0;
		    }
		  }
		  @array1);

print join "\n",@array1_sorted;

print "\n\n";

# ----------------------------------------------------
# Method 2
# Using the <=> operator.
# ----------------------------------------------------
@array2_sorted = (sort {$a <=> $b} @array2);

print join "\n",@array2_sorted;

The Output:

-35
-11
-2
0
1
5
6
7
10

-35
-11
-2
0
1
5
6
7
10

When you want to sort an array of strings just use cmp instead of <=>.

 

Update: line 40 was wrong but the output is the same. Tnx Fuss πŸ˜‰

#!/usr/bin/perl
use strict;
use warnings;my (@array1, @array1_sorted, @array2, @array2_sorted);

@array1 = @array2 = (10, 5, -2, -11, -35, 7, 1, 0, 6);

# —————————————————-
# Method 1
# @sortedArray = (sort { condition(s) } @arrayToSort);
# —————————————————-
@array1_sorted = (sort
{
if ($a < $b)
{
return -1;
}
elsif ($a > $b)
{
return 1;
}
else
{
return 0;
}
}
@array1);

print join “\n”,@array1_sorted;

print “\n\n”;

# —————————————————-
# Method 2
# Using the <=> operator.
# —————————————————-
@array2_sorted = (sort {$a <=> $b} @array2);

print join “\n”,@array1_sorted;

Perl and numbers

Try this in Perl:

#!/usr/bin/perl
use strict;
use warnings;

my $a = "11.123456789";
my $b = "10.123456789";

my $c = "235.123456789";
my $d = "10.123456789";

print $a - sprintf("%.0f", $a)."\n";
print $b - sprintf("%.0f", $b)."\n";

print $c - sprintf("%.0f", $c)."\n";
print $d - sprintf("%.0f", $d)."\n";

if ( ($a - sprintf("%.0f", $a)) eq ($b - sprintf("%.0f", $b)))
{
 print "\nsame: for small numbers it's ok for a and b";
}
else
{
 print "\ndifferent";
}

if ( ($c - sprintf("%.0f", $c)) eq ($d - sprintf("%.0f", $d)))
{
 print "\nsame";
}
else
{
 print "\ndifferent: for a bigger number (c)Β  it fails";
}

This is the output:

0.123456789
0.123456789
0.123456788999988
0.123456789

same: for small numbers it's ok for a and b
different: for a bigger number (c)Β  it fails

Why do we get 0.123456788999988 when the floating point variable is a bit larger (235 in my case)?