PageRank as a probability distribution
From Intelligent Perception
Briefly, it's not and the probabilistic interpretation of PageRank is invalid.
According to Brin/Page paper from 1998, “The probability that the random surfer visits a page is its PageRank.” And “We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.”
When I tried to implement the PageRank algorithm, I quickly discovered that it should be $(1-d)/n$ instead of $(1-d)$. That's a typo, fine. But, after fixing this, I wasted half a day trying to figure out what's wrong with my code because the ranks still wouldn't add up to $1$. Turns out, it was all a lie: they don't have to!
If there is only one page, its PageRank has to be $1-d$.
If there are two pages with $T$ linked to $A$ and nothing else, then we can assign: $T=0$ and $A=(1-d)/2$. They don't have to add up to $1$!
Best one can say is this. If one uses the above formula for iteration and starts with a distribution, then yes, you will get a distribution after every iteration... But only if there are no link-less pages!
To get out of this trouble, there have been some tweaks suggested. Hard to tell which one is the sillier:
- normalizing, i.e., dividing by the sum of all PageRanks, and
- assuming that no links means that the page links to every page.
That's what happens when you try to make the data fit the model!
Even though this problem certainly isn't a new discovery, the myth clearly lives on: Wikipedia says “PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.” It then proceeds to give an example in the next section that isn't a distribution.
The lesson: bad math simply doesn't deserve mathematical analysis. Once it's identified as such, you know it's a dead end!
For a more reasonable interpretation see PageRank as a flow.