# PageRank

(Difference between revisions)
 Revision as of 16:37, 10 August 2010 (edit)← Previous diff Current revision (17:57, 30 October 2016) (edit) (undo) (One intermediate revision not shown.) Line 55: Line 55:

$\displaystyle [itex]\displaystyle - y_i = \sum_{\text{page } j \text{ linking to } i } \alpha*x_j/d_j + (1-\alpha)*v_i. + y_i = \sum_{{\text page } j\,\,{\text linking to } i} \alpha*x_j/d_j + (1-\alpha)*v_i$ [/itex]

Line 88: Line 88: P_{i,j} P_{i,j} = \begin{cases} = \begin{cases} - \frac{1}{d_j} & \text{ if page } i \text{ links to page } j \\ + \frac{1}{d_j} & {\text{ if page } i \text{ links to page } j} \\ - 0 & \text{ otherwise. } \\ + 0 & {\text otherwise. } \\ \end{cases} \end{cases} [/itex] [/itex]

## Current revision

PageRank is a system to associate an importance score to each page on the web. It was developed by Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in the late 1990s. The fundamental idea behind the score is that important pages are those where we expect to find web surfers. With this idea, the two key ingredients to PageRank are:

i) A directed graph abstraction of the web; and
ii) A random surfer model of user behavior.

Idea i) implies that PageRank views each page on the web as a node in a directed graph. A link on a particular page corresponds to a directed edge between the nodes. See the illustration below for an example. This idea was not new to PageRank. Other ranking systems such as Kleinberg's HITS also used a directed graph abstraction of the web.

Idea ii) is the key insight to PageRank. Recall the fundamental idea: important pages are those where we expect to find web surfers. PageRank proposes the following model for the behavior of web surfers:

a) at a page, a surfer follows a link to another page uniformly at random
or
b) at a page, a surfer does something else.

Case a) is pretty straightforward. The surfer knows the current page, and merely must make a random choice for an outlink. In case b), we have no information on what the person actually does, and so we assume they randomly find a new page on the web to begin surfing again. Because the surfer behaves in a manner uncoupled with the connectivity of the web, we call this behavior jumping or teleporting. Also, variants of PageRank modify the behavior in step b) to teleport to a small set of specific nodes on the web. These are usually topic specific pages, for example only sports pages for a sports surfer, or personally interesting pages, for example your bookmarks.

To decide if the surfer follows action a) or b), we toss a weighted coin. Thus, we assign probability $LaTeX: \alpha$ to action a) and probability $LaTeX: (1-\alpha)$ to action b).

At this point, we have a mathematical model for surfer behavior and we can describe aggregate statistics about the behavior of the model. For convenience, we assign an index to each page on the web. The index is a simple label so we can write our equations without always writing out full URLs. To be concrete, instead of writing the probability that the surfer is on slashdot.org we assign slashdot.org an arbitrary but unique number $LaTeX: i$, and then we call $LaTeX: x_i$ the probability that the surfer is on slashdot.org. In general, let $LaTeX: x_i$ be the probability of finding the surfer at page $LaTeX: i$ for each page on the web. Then the probability of finding the surfer at node $LaTeX: i$ after one step of surfing is
$LaTeX: \displaystyle y_i = \sum_{{\text page } j\,\,{\text linking to } i} \alpha*x_j/d_j + (1-\alpha)*v_i$
In this expression, $LaTeX: d_j$ is the number of links from page $LaTeX: j$ and $LaTeX: v_i$ is the probability of a random jump to page $LaTeX: i$. We generally set $LaTeX: v_i = 1/n$ where $LaTeX: n$ is the number of pages on the web, though other choices are possible -- lookup personalized PageRank. Let's continue by running through the components. To get to page $LaTeX: i$, the surfer either had to follow a link to page $LaTeX: i$ or take a random jump to page $LaTeX: i$. To get the probability of following a link to page $LaTeX: i$, we have to see what pages link to page $LaTeX: i$. In the above equation, this set of pages is represented by the sum of $LaTeX: j$. For each of these pages, the surfer will arrive at page $LaTeX: i$ with probability $LaTeX: x_j/d_j$. But then those choices only occur with probability $LaTeX: \alpha$, thus $LaTeX: \alpha x_j/d_j$. Alternatively, the surfer could have taken a random jump with probability $LaTeX: 1-\alpha$, and picked page $LaTeX: i$ with probability $LaTeX: v_i$.

Now, we just keep running this same equation for all pages $LaTeX: i$ in some sequence until the probabilities of where we find the surfer stop changing. And stop changing they do! Let's show this result by diving into a matrix formulation of PageRank. We are going to look for a fixed-point of the above equation.

Let $LaTeX: P$ be a matrix where the $LaTeX: i,j$th entry is the probability of moving from page $LaTeX: j$ to page $LaTeX: i$. (We now need to assume that the page indices $LaTeX: i$ run from 1 to the number of pages on the web, $LaTeX: n$.) Formally,
$LaTeX: \displaystyle P_{i,j}

= \begin{cases} \frac{1}{d_j} & {\text{ if page } i \text{ links to page } j} \\ 0 & {\text otherwise. } \\ \end{cases}

$

Consider the matrix equation
$LaTeX: \displaystyle y = \alpha Px + (1-\alpha) v.$
The $LaTeX: i$th row of this equation is precisely the update for PageRank we described above. In this case, repeatedly applying the formula corresponds to the algorithm:

1. Set $LaTeX: x$ to be any probability vector.
2. Compute $LaTeX: y = \alpha Px + (1-\alpha) v$.
3. Set $LaTeX: x$ to the probabilities computed in $LaTeX: y$,

i.e. just copy everything in $LaTeX: y$ to $LaTeX: x$)

1. Go back to 2 and repeat.

All we have left to do is show that this iteration converges. If the above algorithm converges, then the probabilities in $LaTeX: x$ and $LaTeX: y$ will be the same. Consequently, we need
$LaTeX: \displaystyle

x = \alpha Px + (1-\alpha) v.

$

This is really the linear system:
$LaTeX: \displaystyle

(I - \alpha P) x = (1-\alpha) v.

$

Before we can address convergence, we need to establish whether $LaTeX: x$ is unique. Every column of the matrix $LaTeX: P$ has entries that sum to 1 or 0. Assuming that no page links to itself, then $LaTeX: (I - \alpha P)$ is diagonally dominant when $LaTeX: \alpha < 1$, and thus, $LaTeX: x$ is unique. Does the iteration find a unique $LaTeX: x$? Yes. Let $LaTeX: x^\ast$ be the solution and let $LaTeX: x$ and $LaTeX: y$ be adjacency iterations. Then
$LaTeX: \displaystyle x^\ast - y

= x^\ast - \alpha P x - (1-\alpha) v = \alpha P x^\ast + (1-\alpha) v - \alpha P x - (1-\alpha) v = \alpha P (x^\ast - x).

$

Now, if we take the 1-norm of both sides of this equation:
$LaTeX: \displaystyle

\|x^\ast - y \|_1 = \alpha \| P(x^\ast - x) \|_1 \le \alpha \| P \|_1 \|(x^\ast - x) \|_1 \le \alpha \|(x^\ast - x) \|_1.

$

Thus, we reduce the norm to the exact solution by a factor of $LaTeX: \alpha$ at each iteration. This means the iteration converges.

Dgleich 07:10, 10 August 2010 (PDT)