Programmatically looking up PageRank

In a recent post I talked about PageRank. If you’ve ever tried to look up the PR (PageRank) for a site programmatically, then you’ve probably found a dozen sites out there with code to do it.  That code might work for you, and it might not — if it does, great!  If it doesn’t, good luck figuring out what’s supposed to actually happen.  I recently was curious about the PR of a site that linked to me here at Moremoo, and I figured I’d do it directly instead of installing the Google Toolbar.  Instead of getting the response I expected I got this:

403. That’s an error.

Your client does not have permission to get URL /tbr?features=Rank&client=navclient-auto-ff&ch=XXXXXXXXX&q=info:XXXXXXXXXX%2F' from this server. (Client IP address: X.X.X.X)

Please see Google’s Terms of Service posted at http://www.google.com/terms_of_service.html

If you believe that you have received this response in error, please report your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the entire code displayed below. Please also send us any information you may know about how you are performing your Google searches– for example, “I’m using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP.” or “I’m using the Konqueror browser on Linux to search from my job at myFoo.com. My machine’s IP address is 10.20.30.40, but all of myFoo’s web traffic goes through some kind of proxy server whose IP address is 10.11.12.13.” (If you don’t know any information like this, that’s OK. But this kind of information can help us track down problems, so please tell us what you can.)

We will use all this information to diagnose the problem, and we’ll hopefully have you back up and searching with Google again quickly!

Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don’t despair if you don’t hear back from us!

Also note that if you do not send us the entire code below, we will not be able to help you.

Best wishes,
The Google Team

Not too helpful.  I looked at the code, and it was entirely undocumented.  Clearly it had worked for someone before, but it wasn’t working for me.  To make a long story short, it turns out the code had been written on a 32-bit version of Perl and the hashing code had never been tried on a 64-bit machine.  In hopes that this saves someone a few hours of their own lives, here’s the toolbar API and how exactly you’re supposed to calculate the hash.

To calculate the hash you

  1. load the hash with its initial value (0×01020345)
  2. for each byte of the url
    1. XOR the nth byte of the URL with the (n mod len)th byte of a static string (“Mining PageRank is AGAINST GOOGLE’S TERMS OF SERVICE. Yes, I’m talking to you, scammer.”)
    2. XOR that result against the hash again
    3. rotate the bits of the hash left by 9 bits
  3. prepend “8″ to the hexadecimal representation of the hash

In the implementation that I found, the author assumed that it would always be on a 32-bit machine, so they rotated the bits like this:

result = ((result >> 23) & 0x1ff) | result << 9;

That’s fine on a 32-bit machine because result << 9 overflows and simply discards the higher bits.  On a 64-bit machine, however, they stick around.  In order to work properly, you need to mask off the bottom 32 bits.  Here’s some working Perl code to do this:

 

use LWP::UserAgent;
sub pagerank {
   my ($url) = @_;
   my @seed = map { ord($_) } split //, "Mining PageRank is AGAINST GOOGLE'S TERMS OF SERVICE. Yes, I'm talking to you, scammer.";
   my @url = map { ord($_) } split //, $url;
   $result = 0x01020345;
   foreach my $i (0 .. $#url) {
      my $seed_char = $seed[$i % ($#seed + 1)];
      my $url_char = $url[$i];
      $result ^= $seed_char ^ $url_char;
      $result = (($result >> 23) & 0x1ff) | $result << 9;
   }
   $result &= 0xffffffff;
   my $ua = LWP::UserAgent->new;
   my $uri = URI->new("http://toolbarqueries.google.com/tbr");
   $uri->query_form(
      client => "navclient-auto",
      ch => sprintf("8%08x",$result),
      features => "Rank",
      q => "info:$url",
   );
   my $response = $ua->get($uri);
   if ($response->is_success) {
      my $content = $response->content;
      chop $content;
      my @results = split /:/, $content;
      return $results[2];
   }
   else {
      return undef;
   }
}

 

0saves
If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

Leave a Comment

Powered by WordPress | Deadline Theme : An AWESEM design