# PageRank

### From Wikimization

(→links) |
Current revision (17:57, 30 October 2016) (edit) (undo) |
||

(7 intermediate revisions not shown.) | |||

Line 1: | Line 1: | ||

- | + | PageRank is a system to associate an importance score | |

- | + | to each page on the web. It was developed by | |

- | + | Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in | |

- | + | the late 1990s. The fundamental idea behind the score is | |

- | + | that important pages are those where we expect to find web surfers. | |

- | + | With this idea, the two key ingredients to PageRank are: | |

- | + | <blockquote> | |

- | + | i) A directed graph abstraction of the web; and <br> | |

- | + | ii) A random surfer model of user behavior. | |

- | + | </blockquote> | |

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | Idea i) implies that PageRank views each page on the web as a node | |

- | + | in a directed graph. A link on a particular page corresponds to | |

- | + | a directed edge between the nodes. See the illustration below for | |

- | + | an example. This idea was not new to PageRank. Other ranking systems | |

- | + | such as Kleinberg's HITS also used a directed graph abstraction of | |

- | + | the web. | |

- | + | ||

- | + | ||

- | + | Idea ii) is the key insight to PageRank. Recall the fundamental idea: | |

- | + | important pages are those where we expect to find web surfers. PageRank | |

+ | proposes the following model for the behavior of web surfers: | ||

+ | <blockquote> | ||

+ | a) at a page, a surfer follows a link to another page uniformly at random<br> | ||

+ | or <br> | ||

+ | b) at a page, a surfer does something else. | ||

+ | </blockquote> | ||

+ | Case a) is pretty straightforward. The surfer knows the current | ||

+ | page, and merely must make a random choice for an outlink. | ||

+ | In case b), we have no information on what the person actually does, | ||

+ | and so we assume they randomly find a new page on the web to begin | ||

+ | surfing again. Because the surfer behaves in a manner uncoupled | ||

+ | with the connectivity of the web, we call this behavior ``jumping'' or | ||

+ | ``teleporting''. Also, variants of PageRank modify the behavior | ||

+ | in step b) to teleport to a small set of specific nodes on the web. | ||

+ | These are usually topic specific pages, for example only sports | ||

+ | pages for a sports surfer, or personally interesting pages, for example | ||

+ | your bookmarks. | ||

- | + | To decide if the surfer follows action a) or b), we toss a weighted coin. | |

+ | Thus, we assign probability <math>\alpha</math> to action a) and | ||

+ | probability <math>(1-\alpha)</math> to action b). | ||

- | + | At this point, we have a mathematical model for surfer behavior | |

+ | and we can describe aggregate statistics about the behavior of the model. | ||

+ | For convenience, we assign an index to each page on the web. The | ||

+ | index is a simple label so we can write our equations without always | ||

+ | writing out full URLs. To be concrete, instead of writing | ||

+ | ``the probability that the surfer is on slashdot.org'' we assign | ||

+ | slashdot.org an arbitrary but unique number <math>i</math>, and then we call | ||

+ | <math>x_i</math> ``the probability that the surfer is on slashdot.org.'' | ||

+ | In general, let <math>x_i</math> be the probability of finding the surfer at page <math>i</math> | ||

+ | for each page on the web. | ||

+ | Then the probability of finding the surfer at node <math>i</math> after one | ||

+ | step of surfing is | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | y_i = \sum_{{\text page } j\,\,{\text linking to } i} \alpha*x_j/d_j + (1-\alpha)*v_i | ||

+ | </math> | ||

+ | <br> | ||

+ | In this expression, <math>d_j</math> is the number of links from page <math>j</math> | ||

+ | and <math>v_i</math> is the probability of a random jump to page <math>i</math>. | ||

+ | We generally set <math>v_i = 1/n</math> where <math>n</math> is the number of pages | ||

+ | on the web, though other choices are possible -- lookup personalized PageRank. | ||

+ | Let's continue by running through the components. To get to page <math>i</math>, | ||

+ | the surfer | ||

+ | either had to follow a link to page <math>i</math> or take a random jump to page <math>i</math>. | ||

+ | To get the probability of following a link to page <math>i</math>, we have to see | ||

+ | what pages | ||

+ | link to page <math>i</math>. In the above equation, this set of pages is represented | ||

+ | by the sum of <math>j</math>. For each of these pages, the surfer will | ||

+ | arrive at page <math>i</math> with probability <math>x_j/d_j</math>. But then those choices | ||

+ | only occur with probability <math>\alpha</math>, thus <math>\alpha x_j/d_j</math>. Alternatively, | ||

+ | the surfer could have taken a random jump with probability <math>1-\alpha</math>, | ||

+ | and picked page <math>i</math> with probability <math>v_i</math>. | ||

- | + | Now, we just keep running this same equation for all pages <math>i</math> in some | |

+ | sequence until the probabilities of where we find the surfer stop changing. | ||

+ | And stop changing they do! Let's show this result by diving into | ||

+ | a matrix formulation of PageRank. We are going to look for a | ||

+ | fixed-point of the above equation. | ||

+ | |||

+ | Let <math>P</math> be a matrix where the <math>i,j</math>th entry is the probability of | ||

+ | moving from page <math>j</math> to page <math>i</math>. (We now need to assume | ||

+ | that the page indices <math>i</math> run from 1 to the number of pages | ||

+ | on the web, <math>n</math>.) Formally, | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | P_{i,j} | ||

+ | = \begin{cases} | ||

+ | \frac{1}{d_j} & {\text{ if page } i \text{ links to page } j} \\ | ||

+ | 0 & {\text otherwise. } \\ | ||

+ | \end{cases} | ||

+ | </math> | ||

+ | <br> | ||

+ | Consider the matrix equation | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | y = \alpha Px + (1-\alpha) v. | ||

+ | </math> | ||

+ | <br> | ||

+ | The <math>i</math>th row of this equation is precisely the update | ||

+ | for PageRank we described above. In this case, repeatedly applying | ||

+ | the formula corresponds to the algorithm: | ||

+ | # Set <math>x</math> to be any probability vector. | ||

+ | # Compute <math>y = \alpha Px + (1-\alpha) v</math>. | ||

+ | # Set <math>x</math> to the probabilities computed in <math>y</math>, | ||

+ | i.e. just copy everything in <math>y</math> to <math>x</math>) | ||

+ | # Go back to 2 and repeat. | ||

+ | |||

+ | All we have left to do is show that this iteration converges. | ||

+ | If the above algorithm converges, then the probabilities in | ||

+ | <math>x</math> and <math>y</math> will be the same. Consequently, we need | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | x = \alpha Px + (1-\alpha) v. | ||

+ | </math> | ||

+ | <br> | ||

+ | This is really the linear system: | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | (I - \alpha P) x = (1-\alpha) v. | ||

+ | </math> | ||

+ | <br> | ||

+ | Before we can address convergence, we need to establish | ||

+ | whether <math>x</math> is unique. | ||

+ | Every column of the matrix <math>P</math> has entries that sum to 1 | ||

+ | or 0. Assuming that no page links to itself, then | ||

+ | <math>(I - \alpha P)</math> is diagonally dominant when <math>\alpha < 1</math>, | ||

+ | and thus, <math>x</math> is unique. | ||

+ | Does the iteration find a unique <math>x</math>? Yes. | ||

+ | Let <math>x^\ast</math> be the solution and let <math>x</math> | ||

+ | and <math>y</math> be adjacency iterations. Then | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | x^\ast - y | ||

+ | = x^\ast - \alpha P x - (1-\alpha) v | ||

+ | = \alpha P x^\ast + (1-\alpha) v - \alpha P x - (1-\alpha) v | ||

+ | = \alpha P (x^\ast - x). | ||

+ | </math> | ||

+ | <br> | ||

+ | Now, if we take the 1-norm of both sides of this equation: | ||

+ | <br> | ||

+ | <math>\displaystyle | ||

+ | \|x^\ast - y \|_1 | ||

+ | = \alpha \| P(x^\ast - x) \|_1 | ||

+ | \le \alpha \| P \|_1 \|(x^\ast - x) \|_1 | ||

+ | \le \alpha \|(x^\ast - x) \|_1. | ||

+ | </math> | ||

+ | <br> | ||

+ | Thus, we reduce the norm to the exact solution by a factor | ||

+ | of <math>\alpha</math> at each iteration. This means the iteration | ||

+ | converges. | ||

+ | |||

+ | [[User:Dgleich|Dgleich]] 07:10, 10 August 2010 (PDT) |

## Current revision

PageRank is a system to associate an importance score to each page on the web. It was developed by Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in the late 1990s. The fundamental idea behind the score is that important pages are those where we expect to find web surfers. With this idea, the two key ingredients to PageRank are:

i) A directed graph abstraction of the web; and

ii) A random surfer model of user behavior.

Idea i) implies that PageRank views each page on the web as a node in a directed graph. A link on a particular page corresponds to a directed edge between the nodes. See the illustration below for an example. This idea was not new to PageRank. Other ranking systems such as Kleinberg's HITS also used a directed graph abstraction of the web.

Idea ii) is the key insight to PageRank. Recall the fundamental idea: important pages are those where we expect to find web surfers. PageRank proposes the following model for the behavior of web surfers:

a) at a page, a surfer follows a link to another page uniformly at random

or

b) at a page, a surfer does something else.

Case a) is pretty straightforward. The surfer knows the current
page, and merely must make a random choice for an outlink.
In case b), we have no information on what the person actually does,
and so we assume they randomly find a new page on the web to begin
surfing again. Because the surfer behaves in a manner uncoupled
with the connectivity of the web, we call this behavior ``jumping* or*
``teleporting*. Also, variants of PageRank modify the behavior*
in step b) to teleport to a small set of specific nodes on the web.
These are usually topic specific pages, for example only sports
pages for a sports surfer, or personally interesting pages, for example
your bookmarks.

To decide if the surfer follows action a) or b), we toss a weighted coin. Thus, we assign probability to action a) and probability to action b).

At this point, we have a mathematical model for surfer behavior
and we can describe aggregate statistics about the behavior of the model.
For convenience, we assign an index to each page on the web. The
index is a simple label so we can write our equations without always
writing out full URLs. To be concrete, instead of writing
``the probability that the surfer is on slashdot.org* we assign*
slashdot.org an arbitrary but unique number , and then we call
``the probability that the surfer is on slashdot.org.
In general, let be the probability of finding the surfer at page
for each page on the web.
Then the probability of finding the surfer at node after one
step of surfing is

In this expression, is the number of links from page
and is the probability of a random jump to page .
We generally set where is the number of pages
on the web, though other choices are possible -- lookup personalized PageRank.
Let's continue by running through the components. To get to page ,
the surfer
either had to follow a link to page or take a random jump to page .
To get the probability of following a link to page , we have to see
what pages
link to page . In the above equation, this set of pages is represented
by the sum of . For each of these pages, the surfer will
arrive at page with probability . But then those choices
only occur with probability , thus . Alternatively,
the surfer could have taken a random jump with probability ,
and picked page with probability .

Now, we just keep running this same equation for all pages in some sequence until the probabilities of where we find the surfer stop changing. And stop changing they do! Let's show this result by diving into a matrix formulation of PageRank. We are going to look for a fixed-point of the above equation.

Let be a matrix where the th entry is the probability of
moving from page to page . (We now need to assume
that the page indices run from 1 to the number of pages
on the web, .) Formally,

Consider the matrix equation

The th row of this equation is precisely the update
for PageRank we described above. In this case, repeatedly applying
the formula corresponds to the algorithm:

- Set to be any probability vector.
- Compute .
- Set to the probabilities computed in ,

i.e. just copy everything in to )

- Go back to 2 and repeat.

All we have left to do is show that this iteration converges.
If the above algorithm converges, then the probabilities in
and will be the same. Consequently, we need

This is really the linear system:

Before we can address convergence, we need to establish
whether is unique.
Every column of the matrix has entries that sum to 1
or 0. Assuming that no page links to itself, then
is diagonally dominant when ,
and thus, is unique.
Does the iteration find a unique ? Yes.
Let be the solution and let
and be adjacency iterations. Then

Now, if we take the 1-norm of both sides of this equation:

Thus, we reduce the norm to the exact solution by a factor
of at each iteration. This means the iteration
converges.

Dgleich 07:10, 10 August 2010 (PDT)