# PageRank

(Difference between revisions)
Jump to: navigation, search
 Revision as of 00:23, 4 March 2009 (edit) (→links)← Previous diff Current revision (16:57, 30 October 2016) (edit) (undo) (7 intermediate revisions not shown.) Line 1: Line 1: - [[Image:Gleich.jpg|thumb|right|429px|CSUM() in Digital Signal Processing terms: + PageRank is a system to associate an importance score - z-1 is a unit delay, Q is a floating-point quantizer to 64 bits, + to each page on the web. It was developed by - qi represents error due to quantization (additive by definition).
-Jon Dattorro]] + Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in -
+                                                        the late 1990s.  The fundamental idea behind the score is
-                                                                  function s_hat=csum(x)                                                                                                                                                                                           +                                                        that important pages are those where we expect to find web surfers.
-                                                                  % CSUM Sum of elements using a compensated summation algorithm.                                                                                                                                                  +                                                        With this idea, the two key ingredients to PageRank are:
-                                                                  % David Gleich, Stanford University, 2008                                                                                                                                                                        +
- % + i) A directed graph abstraction of the web; and
- % For large vectors, the native sum command in Matlab does + ii) A random surfer model of user behavior. - % not appear to use a compensated summation algorithm which +
- % can cause significant roundoff errors. + - % + - % This code implements a variant of Kahan's compensated + - % summation algorithm which often takes about twice as long, + - % but produces more accurate sums when the number of + - % elements is large. + - % + - % See also SUM + - % + - % Example: + - % v=rand(1e7,1); + - % sum1 = sum(v); + - % sum2 = csum(v); + - % fprintf('sum1 = %18.16e\nsum2 = %18.16e\n', sum1, sum2); + - s_hat=0; y=0; e=0; + Idea i) implies that PageRank views each page on the web as a node - for i=1:numel(x) + in a directed graph. A link on a particular page corresponds to - s_hat_old = s_hat; + a directed edge between the nodes. See the illustration below for - y = x(i) + e; + an example. This idea was not new to PageRank. Other ranking systems - s_hat = s_hat_old + y; + such as Kleinberg's HITS also used a directed graph abstraction of - e = (s_hat_old - s_hat) + y; %calculate difference first (Higham) + the web. - end + -
+ - === links === + Idea ii) is the key insight to PageRank. Recall the fundamental idea: - [http://www.google.com/books?id=FJyBjjtHREQC&dq=Accuracy+and+Stability+of+Numerical+Algorithms&printsec=frontcover&source=bn#PPA92,M1 Accuracy and Stability of Numerical Algorithms, Nicholas J. Higham, 1996] + important pages are those where we expect to find web surfers. PageRank + proposes the following model for the behavior of web surfers: +
+ a) at a page, a surfer follows a link to another page uniformly at random
+ or
+ b) at a page, a surfer does something else. +
+ Case a) is pretty straightforward. The surfer knows the current + page, and merely must make a random choice for an outlink. + In case b), we have no information on what the person actually does, + and so we assume they randomly find a new page on the web to begin + surfing again. Because the surfer behaves in a manner uncoupled + with the connectivity of the web, we call this behavior jumping'' or + teleporting''. Also, variants of PageRank modify the behavior + in step b) to teleport to a small set of specific nodes on the web. + These are usually topic specific pages, for example only sports + pages for a sports surfer, or personally interesting pages, for example + your bookmarks. - For multiplier error feedback, see: + To decide if the surfer follows action a) or b), we toss a weighted coin. + Thus, we assign probability $\alpha$ to action a) and + probability $(1-\alpha)$ to action b). - [http://www.stanford.edu/~dattorro/HiFi.pdf Implementation of Recursive Digital Filters for High-Fidelity Audio] + At this point, we have a mathematical model for surfer behavior + and we can describe aggregate statistics about the behavior of the model. + For convenience, we assign an index to each page on the web. The + index is a simple label so we can write our equations without always + writing out full URLs. To be concrete, instead of writing + the probability that the surfer is on slashdot.org'' we assign + slashdot.org an arbitrary but unique number $i, and then we call + [itex]x_i the probability that the surfer is on slashdot.org.'' + In general, let [itex]x_i be the probability of finding the surfer at page [itex]i + for each page on the web. + Then the probability of finding the surfer at node [itex]i$ after one + step of surfing is +
+ $\displaystyle + y_i = \sum_{{\text page } j\,\,{\text linking to } i} \alpha*x_j/d_j + (1-\alpha)*v_i +$ +
+ In this expression, $d_j$ is the number of links from page $j$ + and $v_i$ is the probability of a random jump to page $i$. + We generally set $v_i = 1/n$ where $n$ is the number of pages + on the web, though other choices are possible -- lookup personalized PageRank. + Let's continue by running through the components. To get to page $i$, + the surfer + either had to follow a link to page $i$ or take a random jump to page $i$. + To get the probability of following a link to page $i$, we have to see + what pages + link to page $i$. In the above equation, this set of pages is represented + by the sum of $j$. For each of these pages, the surfer will + arrive at page $i$ with probability $x_j/d_j$. But then those choices + only occur with probability $\alpha$, thus $\alpha x_j/d_j$. Alternatively, + the surfer could have taken a random jump with probability $1-\alpha$, + and picked page $i$ with probability $v_i$. - [http://www.stanford.edu/~dattorro/CorrectionsHiFi.pdf Comments on the above...] + Now, we just keep running this same equation for all pages $i$ in some + sequence until the probabilities of where we find the surfer stop changing. + And stop changing they do! Let's show this result by diving into + a matrix formulation of PageRank. We are going to look for a + fixed-point of the above equation. + + Let $P$ be a matrix where the $i,j$th entry is the probability of + moving from page $j$ to page $i$. (We now need to assume + that the page indices $i$ run from 1 to the number of pages + on the web, $n$.) Formally, +
+ $\displaystyle + P_{i,j} + = \begin{cases} + \frac{1}{d_j} & {\text{ if page } i \text{ links to page } j} \\ + 0 & {\text otherwise. } \\ + \end{cases} +$ +
+ Consider the matrix equation +
+ $\displaystyle + y = \alpha Px + (1-\alpha) v. +$ +
+ The $i$th row of this equation is precisely the update + for PageRank we described above. In this case, repeatedly applying + the formula corresponds to the algorithm: + # Set $x to be any probability vector. + # Compute [itex]y = \alpha Px + (1-\alpha) v. + # Set [itex]x$ to the probabilities computed in $y$, + i.e. just copy everything in $y to [itex]x) + # Go back to 2 and repeat. + + All we have left to do is show that this iteration converges. + If the above algorithm converges, then the probabilities in + [itex]x$ and $y$ will be the same. Consequently, we need +
+ $\displaystyle + x = \alpha Px + (1-\alpha) v. +$ +
+ This is really the linear system: +
+ $\displaystyle + (I - \alpha P) x = (1-\alpha) v. +$ +
+ Before we can address convergence, we need to establish + whether $x$ is unique. + Every column of the matrix $P$ has entries that sum to 1 + or 0. Assuming that no page links to itself, then + $(I - \alpha P)$ is diagonally dominant when $\alpha < 1$, + and thus, $x$ is unique. + Does the iteration find a unique $x$? Yes. + Let $x^\ast$ be the solution and let $x$ + and $y$ be adjacency iterations. Then +
+ $\displaystyle + x^\ast - y + = x^\ast - \alpha P x - (1-\alpha) v + = \alpha P x^\ast + (1-\alpha) v - \alpha P x - (1-\alpha) v + = \alpha P (x^\ast - x). +$ +
+ Now, if we take the 1-norm of both sides of this equation: +
+ $\displaystyle + \|x^\ast - y \|_1 + = \alpha \| P(x^\ast - x) \|_1 + \le \alpha \| P \|_1 \|(x^\ast - x) \|_1 + \le \alpha \|(x^\ast - x) \|_1. +$ +
+ Thus, we reduce the norm to the exact solution by a factor + of $\alpha$ at each iteration. This means the iteration + converges. + + [[User:Dgleich|Dgleich]] 07:10, 10 August 2010 (PDT)

## Current revision

PageRank is a system to associate an importance score to each page on the web. It was developed by Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in the late 1990s. The fundamental idea behind the score is that important pages are those where we expect to find web surfers. With this idea, the two key ingredients to PageRank are:

i) A directed graph abstraction of the web; and
ii) A random surfer model of user behavior.

Idea i) implies that PageRank views each page on the web as a node in a directed graph. A link on a particular page corresponds to a directed edge between the nodes. See the illustration below for an example. This idea was not new to PageRank. Other ranking systems such as Kleinberg's HITS also used a directed graph abstraction of the web.

Idea ii) is the key insight to PageRank. Recall the fundamental idea: important pages are those where we expect to find web surfers. PageRank proposes the following model for the behavior of web surfers:

a) at a page, a surfer follows a link to another page uniformly at random
or
b) at a page, a surfer does something else.

Case a) is pretty straightforward. The surfer knows the current page, and merely must make a random choice for an outlink. In case b), we have no information on what the person actually does, and so we assume they randomly find a new page on the web to begin surfing again. Because the surfer behaves in a manner uncoupled with the connectivity of the web, we call this behavior jumping or teleporting. Also, variants of PageRank modify the behavior in step b) to teleport to a small set of specific nodes on the web. These are usually topic specific pages, for example only sports pages for a sports surfer, or personally interesting pages, for example your bookmarks.

To decide if the surfer follows action a) or b), we toss a weighted coin. Thus, we assign probability $LaTeX: \alpha$ to action a) and probability $LaTeX: (1-\alpha)$ to action b).

At this point, we have a mathematical model for surfer behavior and we can describe aggregate statistics about the behavior of the model. For convenience, we assign an index to each page on the web. The index is a simple label so we can write our equations without always writing out full URLs. To be concrete, instead of writing the probability that the surfer is on slashdot.org we assign slashdot.org an arbitrary but unique number $LaTeX: i$, and then we call $LaTeX: x_i$ the probability that the surfer is on slashdot.org. In general, let $LaTeX: x_i$ be the probability of finding the surfer at page $LaTeX: i$ for each page on the web. Then the probability of finding the surfer at node $LaTeX: i$ after one step of surfing is
$LaTeX: \displaystyle y_i = \sum_{{\text page } j\,\,{\text linking to } i} \alpha*x_j/d_j + (1-\alpha)*v_i$
In this expression, $LaTeX: d_j$ is the number of links from page $LaTeX: j$ and $LaTeX: v_i$ is the probability of a random jump to page $LaTeX: i$. We generally set $LaTeX: v_i = 1/n$ where $LaTeX: n$ is the number of pages on the web, though other choices are possible -- lookup personalized PageRank. Let's continue by running through the components. To get to page $LaTeX: i$, the surfer either had to follow a link to page $LaTeX: i$ or take a random jump to page $LaTeX: i$. To get the probability of following a link to page $LaTeX: i$, we have to see what pages link to page $LaTeX: i$. In the above equation, this set of pages is represented by the sum of $LaTeX: j$. For each of these pages, the surfer will arrive at page $LaTeX: i$ with probability $LaTeX: x_j/d_j$. But then those choices only occur with probability $LaTeX: \alpha$, thus $LaTeX: \alpha x_j/d_j$. Alternatively, the surfer could have taken a random jump with probability $LaTeX: 1-\alpha$, and picked page $LaTeX: i$ with probability $LaTeX: v_i$.

Now, we just keep running this same equation for all pages $LaTeX: i$ in some sequence until the probabilities of where we find the surfer stop changing. And stop changing they do! Let's show this result by diving into a matrix formulation of PageRank. We are going to look for a fixed-point of the above equation.

Let $LaTeX: P$ be a matrix where the $LaTeX: i,j$th entry is the probability of moving from page $LaTeX: j$ to page $LaTeX: i$. (We now need to assume that the page indices $LaTeX: i$ run from 1 to the number of pages on the web, $LaTeX: n$.) Formally,
$LaTeX: \displaystyle P_{i,j}

= \begin{cases} \frac{1}{d_j} & {\text{ if page } i \text{ links to page } j} \\ 0 & {\text otherwise. } \\ \end{cases}

$

Consider the matrix equation
$LaTeX: \displaystyle y = \alpha Px + (1-\alpha) v.$
The $LaTeX: i$th row of this equation is precisely the update for PageRank we described above. In this case, repeatedly applying the formula corresponds to the algorithm:

1. Set $LaTeX: x$ to be any probability vector.
2. Compute $LaTeX: y = \alpha Px + (1-\alpha) v$.
3. Set $LaTeX: x$ to the probabilities computed in $LaTeX: y$,

i.e. just copy everything in $LaTeX: y$ to $LaTeX: x$)

1. Go back to 2 and repeat.

All we have left to do is show that this iteration converges. If the above algorithm converges, then the probabilities in $LaTeX: x$ and $LaTeX: y$ will be the same. Consequently, we need
$LaTeX: \displaystyle

x = \alpha Px + (1-\alpha) v.

$

This is really the linear system:
$LaTeX: \displaystyle

(I - \alpha P) x = (1-\alpha) v.

$

Before we can address convergence, we need to establish whether $LaTeX: x$ is unique. Every column of the matrix $LaTeX: P$ has entries that sum to 1 or 0. Assuming that no page links to itself, then $LaTeX: (I - \alpha P)$ is diagonally dominant when $LaTeX: \alpha < 1$, and thus, $LaTeX: x$ is unique. Does the iteration find a unique $LaTeX: x$? Yes. Let $LaTeX: x^\ast$ be the solution and let $LaTeX: x$ and $LaTeX: y$ be adjacency iterations. Then
$LaTeX: \displaystyle x^\ast - y

= x^\ast - \alpha P x - (1-\alpha) v = \alpha P x^\ast + (1-\alpha) v - \alpha P x - (1-\alpha) v = \alpha P (x^\ast - x).

$

Now, if we take the 1-norm of both sides of this equation:
$LaTeX: \displaystyle

\|x^\ast - y \|_1 = \alpha \| P(x^\ast - x) \|_1 \le \alpha \| P \|_1 \|(x^\ast - x) \|_1 \le \alpha \|(x^\ast - x) \|_1.

$

Thus, we reduce the norm to the exact solution by a factor of $LaTeX: \alpha$ at each iteration. This means the iteration converges.

Dgleich 07:10, 10 August 2010 (PDT)