PageRank

From Wikimization

Revision as of 07:10, 10 August 2010 by Dgleich (Talk | contribs)
Jump to: navigation, search

PageRank is a system to associate an importance score to each page on the web. It was developed by Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in the late 1990s. The fundamental idea behind the score is that important pages are those where we expect to find web surfers. With this idea, the two key ingredients to PageRank are:

i) A directed graph abstraction of the web; and
ii) A random surfer model of user behavior.

Idea i) implies that PageRank views each page on the web as a node in a directed graph. A link on a particular page corresponds to a directed edge between the nodes. See the illustration below for an example. This idea was not new to PageRank. Other ranking systems such as Kleinberg's HITS also used a directed graph abstraction of the web.

Idea ii) is the key insight to PageRank. Recall the fundamental idea: important pages are those where we expect to find web surfers. PageRank proposes the following model for the behavior of web surfers:

a) at a page, a surfer follows a link to another page uniformly at random
or
b) at a page, a surfer does something else.

Case a) is pretty straightforward. The surfer knows the current page, and merely must make a random choice for an outlink. In case b), we have no information on what the person actually does, and so we assume they randomly find a new page on the web to begin surfing again. Because the surfer behaves in a manner uncoupled with the connectivity of the web, we call this behavior ``jumping or ``teleporting. Also, variants of PageRank modify the behavior in step b) to teleport to a small set of specific nodes on the web. These are usually topic specific pages, for example only sports pages for a sports surfer, or personally interesting pages, for example your bookmarks.

To decide if the surfer follows action a) or b), we toss a weighted coin. Thus, we assign probability LaTeX: \alpha to action a) and probability LaTeX: (1-\alpha) to action b).

At this point, we have a mathematical model for surfer behavior and we can describe aggregate statistics about the behavior of the model. For convenience, we assign an index to each page on the web. The index is a simple label so we can write our equations without always writing out full URLs. To be concrete, instead of writing ``the probability that the surfer is on slashdot.org we assign slashdot.org an arbitrary but unique number LaTeX: i, and then we call LaTeX: x_i ``the probability that the surfer is on slashdot.org. In general, let LaTeX: x_i be the probability of finding the surfer at page LaTeX: i for each page on the web. Then the probability of finding the surfer at node LaTeX: i after one step of surfing is
LaTeX: \displaystyle
y_i = \sum_{\text{page } j \text{ linking to } i } \alpha*x_j/d_j + (1-\alpha)*v_i.
In this expression, LaTeX: d_j is the number of links from page LaTeX: j and LaTeX: v_i is the probability of a random jump to page LaTeX: i. We generally set LaTeX: v_i = 1/n where LaTeX: n is the number of pages on the web, though other choices are possible -- lookup personalized PageRank. Let's continue by running through the components. To get to page LaTeX: i, the surfer either had to follow a link to page LaTeX: i or take a random jump to page LaTeX: i. To get the probability of following a link to page LaTeX: i, we have to see what pages link to page LaTeX: i. In the above equation, this set of pages is represented by the sum of LaTeX: j. For each of these pages, the surfer will arrive at page LaTeX: i with probability LaTeX: x_j/d_j. But then those choices only occur with probability LaTeX: \alpha, thus LaTeX: \alpha x_j/d_j. Alternatively, the surfer could have taken a random jump with probability LaTeX: 1-\alpha, and picked page LaTeX: i with probability LaTeX: v_i.

Now, we just keep running this same equation for all pages LaTeX: i in some sequence until the probabilities of where we find the surfer stop changing. And stop changing they do! Let's show this result by diving into a matrix formulation of PageRank. We are going to look for a fixed-point of the above equation.

Let LaTeX: P be a matrix where the LaTeX: i,jth entry is the probability of moving from page LaTeX: j to page LaTeX: i. (We now need to assume that the page indices LaTeX: i run from 1 to the number of pages on the web, LaTeX: n.) Formally,
LaTeX: \displaystyle
P_{i,j}
</p>
<pre>= \begin{cases}
   \frac{1}{d_j} & \text{ if page } i \text{ links to page } j \\
   0             & \text{ otherwise. } \\
  \end{cases}
</pre>
<p>
Consider the matrix equation
LaTeX: \displaystyle
y = \alpha Px + (1-\alpha) v.
The LaTeX: ith row of this equation is precisely the update for PageRank we described above. In this case, repeatedly applying the formula corresponds to the algorithm:

  1. Set LaTeX: x to be any probability vector.
  2. Compute LaTeX: y = \alpha Px + (1-\alpha) v.
  3. Set LaTeX: x to the probabilities computed in LaTeX: y,

i.e. just copy everything in LaTeX: y to LaTeX: x)

  1. Go back to 2 and repeat.

All we have left to do is show that this iteration converges. If the above algorithm converges, then the probabilities in LaTeX: x and LaTeX: y will be the same. Consequently, we need
LaTeX: \displaystyle
</p>
<pre>x = \alpha Px + (1-\alpha) v.
</pre>
<p>
This is really the linear system:
LaTeX: \displaystyle
</p>
<pre>(I - \alpha P) x = (1-\alpha) v. 
</pre>
<p>
Before we can address convergence, we need to establish whether LaTeX: x is unique. Every column of the matrix LaTeX: P has entries that sum to 1 or 0. Assuming that no page links to itself, then LaTeX: (I - \alpha P) is diagonally dominant when LaTeX: \alpha < 1, and thus, LaTeX: x is unique. Does the iteration find a unique LaTeX: x? Yes. Let LaTeX: x^\ast be the solution and let LaTeX: x and LaTeX: y be adjacency iterations. Then
LaTeX: \displaystyle
x^\ast - y
</p>
<pre>= x^\ast - \alpha P x - (1-\alpha) v
= \alpha P x^\ast + (1-\alpha) v - \alpha P x - (1-\alpha) v
= \alpha P (x^\ast - x).
</pre>
<p>
Now, if we take the 1-norm of both sides of this equation:
LaTeX: \displaystyle
</p>
<pre>\|x^\ast - y \|_1
 = \alpha \| P(x^\ast - x) \|_1
 \le \alpha \| P \|_1 \|(x^\ast - x) \|_1
 \le \alpha \|(x^\ast - x) \|_1.
</pre>
<p> </br> Thus, we reduce the norm to the exact solution by a factor of LaTeX: \alpha at each iteration. This means the iteration converges.

Dgleich 07:10, 10 August 2010 (PDT)

Personal tools