July 5, 2011

PageRank is bad math: discussion

Filed under: data analysis,mathematics,reviews — Peter Saveliev @ 10:48 pm

My previous post on the subject was “PageRank is an abomination (mathematically)”. My thesis was:

PageRank, as described by Google, is bad math.

Why? It appears that initially they made an arbitrary choice of the damping constant even though the choice affects the rankings.

The parameter is made-up and hidden from the user.

What has happened since, I don’t know. It’s a secret. I certainly haven’t heard an alternative story so far. And I can rely only on how it is described by Google – in the original papers and on their site. Below is a summary of the reaction to my post.

Re-reading the discussion, I can see now how many times I was distracted from my thesis by the “Google search works great” argument. I should’ve just replied: “PageRank is bad math. Please comment.” I admit though, I went beyond my main thesis and conjectured that PageRank’s problems are the cause of the problems of the whole search algorithm. I’ll try to make a case for that in a future post.

The response at Reddit was mostly along the lines: “The article is terrible, because Google search works great.” A surprising reaction to my thesis. Digging a bit deeper I realized that these people:

  1. love Google;
  2. assume that Google search == PageRank (or think that I do);
  3. think that “bad math” means 1+1=3.

The crowd is “technical”, I suppose. So, #1 is understandable, but #2 is not, while #3 is very, very common.

The praise for Google was very homogeneous:

“PR-based algorithms seem to work (very) well in the real world”; “yield very good search results”; “Google’s search algorithm works amazingly well”; “Google’s algorithm works extremely well”; “The page rank algorithm is actually extremely impressive, and obviously works well.”

What I see is that this attitude has shown cracks recently. There have been public complaints about Google search results: the JCPenney story, content farms, scrapers, Google’s own properties ranked above others, etc. These are recent complains from site owners  about the Panda update. And just because Google dominates other search engines, also based on the PageRank idea, doesn’t make it good math.

There were other arguments against my thesis.

“[H]is main concern is over the fact that Google arbitrarily picked a constant for the decay factor, but that’s not actually a bad thing, and is done all the time in mathematical modeling… in no way upsets me as a mathematician.”

Reddit is anonymous by default, so I don’t know… A mathematician would ask about the effects of such a choice and try to prove (or disprove) that there isn’t any. Unfortunately, the choice does affect the order of pages that you get as the referenced paper indicated. As for the “all the time” comment, so what? If you do that, to me it means that you can’t come up with a better model (I think there is a way).

“[T]he graph used in the paper [that shows dependence of the rankings on the damping factor] is fairly contrived, and we’re mostly interested in what the algorithm does on real-world data.”

A fair point but one still has to ask for some evidence. Is there a case study where a large number of websites are analyzed and it is shown that changing the damping factor DOES NOT  significantly affect the ranking of these sites? Well, at least this person didn’t claim to be a mathematician… I have to add then that even such a study wouldn’t resolve the bad math issue. What would? You’d need to

  1. define a class of graphs, let’s call them non-contrived;
  2. prove that the PageRank of a non-contrived graph is independent of the damping factor (or at least the top results aren’t);
  3. provide evidence that the Internet, now and in the future, is non-contrived.

“Pagerank is a starting point; it provides a rough sketch of page importance which is fine tuned by other more specific algorithms”.

Let’s consider this analogy: π =3 is bad math, but it’s “a very good starting point” for solving many real life problems. For example, you can build a hut, no problem. But what if you want to do something more sophisticated like building an airplane? With π =3 your plane will drop like a brick. And this will keep happening, no matter how much you fine tune your engineering. Suppose now that you replace π =3 with π = 3.14159265358979. OK, you’ve replaced bad math with better math, or maybe even good enough math (you can build your plane now). But π=3.14159265358979 is a time bomb! Sooner or later it will fail you when it’s not accurate enough anymore. Sooner or later you will need to understand what π is. Sooner or later you will need good math… (Is this what’s happened to Google?)

More to come

This stuff will end up in the main site under PageRank and, yes, Bad math.

Comments are closed.