Scrape alexa rank with PHP

Text-only Preview

Scrape alexa rank with PHP
This tutorial will teach you how to scrape Alexa rank for any website. I will use two different methods
for scraping the rank. The first one is using regular expressions and the second one is by using a
php/jquery library called phpquery.
Parse alexa rank using regular expressions
First thing we should do is to analyse the page and search for the html source code where the rank is
displayed.
<a href="/siteowners/certify?ax_atid=b2383bed-44cb-4edb-b1c5-
dda06e67bfeb&site=newexception.com">84,685</a>
From this chunk of html code we can construct the following regular expression:
<a href\=\"\/siteowners\/certify\?[^>]+>([0-9\,]+)<\/a>
Where:
'[^>]+' - match one or more symbols until you find a '>' sign
([0-9\,]+) - match one or more numbers and commas
<?php
function alexa_rank($domain){
$data = file_get_contents( "http://www.alexa.com/siteinfo/" . $domain );
if( $data === false ){
return false;
}
$regex = "/<a href\=\"\/siteowners\/certify\?[^>]+>([0-9\,]+)<\/a>/i";
if( preg_match( $regex, $data, $match ) ){
return str_replace( ",", "", $match[1] );
}else{
$regex = "/<a href\=\"\/siteowners\/certify\?[^>]+><span[^>]+>\-<\/span><\/a>/i";
if( preg_match( $regex, $data, $match ) ){
return 0;
}
}
return false;
}
?>
The second regular expression is used to check if alexa does not have rank for the specified domain, so
we can know whether the parser is working or not.

Parse alexa rank using phpquery library
As the documentation says: phpQuery is a CSS3 selector driven Document Object Model API based on
jQuery JavaScript Library. So if you are good with jQuery, you will find this very handy.
<?php
include('phpQuery.php');
function alexa_rank($domain){
$data = file_get_contents("http://www.alexa.com/siteinfo/" . $domain);
$doc = phpQuery::newDocument($data);
echo pq('.metricsUrl a')->text();
}
?>
Because i couldn't find a unique identifier for the 'a' html tag, i used the 'metricsUrl' class name which
is located in the 'span' html tag.
Conclusion
The phpquery method is far way slower than the regular expression, because it needs to parse and load
the whole DOM in memory. Another way to scrape and parse alexa rank is by using the 'PHP Simple
HTML DOM Parser'.
Source: http://newexception.com/alexa-rank-scraper