## October 23, 2011

Filed under: data analysis,mathematics,rants — Peter Saveliev @ 12:10 pm

To experiment with ideas like this I’ve created this little spreadsheet. The formula comes from the original Page/Brin paper from 1998.

Suppose we have these ten “pages”: A, B, …, J. They are listed in the table as both rows and columns. The links are the entries in the table.

First we suppose that they link to each other consecutively: A → B → C → … → J.

The PageRank scores, as well as the positions, of the pages are computed and displayed on the right. The scores for A, …, J are increasing in the obvious way: from 0 to .1. In particular, F is #5.

Next, suppose F adds a link to A: F → A. You can see that 1 has appeared in the first column of the table:

Now, suddenly F ranks first!

The result certainly goes against our intuition. It shouldn’t be surprising though considering other problems with the math of PageRank.

This isn’t a new discovery of course. These are just a couple of references for you:

1. The effect of new links on Google PageRank by Konstantin Avrachenkov and Nelly Litvak, Stochastic Models, 22:319–331, 2006.

2. Maximizing PageRank via outlinks by Cristobald de Kerchove, Laure Ninove, and Paul van Dooren, Linear Algebra and its Applications 429 (2008) 1254–1276.

The conclusion is that the appearance of a cycle is what seems to be the reason for this behavior…

You won’t get a straight answer from Google on this. When asked to explain what PageRank is, they say it’s hardly a secret and refer you to the 1998 paper (or rehash its contents like here). However, if you criticize those ideas, the answer is: it’s irrelevant because PageRank isn’t that anymore, it is now composed of 200 various “signals”. So, what’s PageRank anyway? Glad you asked, it’s all explained in that paper from 1998…

Another cycle.