Scrape and download pdf files from google or bing using PowerShell

Here is a very simple script that you could execute using PowerShell ISE. It could probably be written much better, but it works. The script just uses the power of the google search engine by searching for a specific filetype. This should also work with the Bing search engine.
To make the script work, make sure you have a directory C:\temp\dwnld\ created. Also you could easily change the regular expression pattern and the keywords.

Comments with modifications on the scripts are always welcome 😉

$keywords = @("manual", "microsoft", "powershell")
$pattern = 'http://(.*?)[.]{1}pdf'
$storageDir = "C:\temp\dwnld\"
$filetype = "pdf"
$rand = New-Object System.Random

$keywords | foreach {
    $urlToScrapeWithKeyword = "http://www.google.be/search?hl=nl&tbo=d&biw=1229&bih=677&output=search&sclient=psy-ab&q={0}+filetype%3A{1}&btnK=" -f $_, $filetype
    $urlToScrapeWithKeyword | Out-Default
    (Invoke-WebRequest -UseBasicParsing -Uri $urlToScrapeWithKeyword).Links | select -ExpandProperty href | Get-Unique | foreach {
        if ($_ -match $pattern) {
            $Matches[0] | Out-Default
            try {
                Start-BitsTransfer $Matches[0] $storageDir
                "Download ok" | Out-Default
            } catch [exception] {
                "Download failed:" | Out-Default
                $_.Exception.Message | Out-Default
            }
            "Sleeping" | Out-Default
            Start-Sleep -s $rand.Next(20, 43)
        }
    }
}

Enjoy 😉
Don’t forget, web scraping can be illegal! Use it with care!

Advertisements

One thought on “Scrape and download pdf files from google or bing using PowerShell

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s