PageRank
From Wikimization
Line 55: | Line 55: | ||
<br> | <br> | ||
<math>\displaystyle | <math>\displaystyle | ||
- | y_i = \sum_{\text | + | y_i = \sum_{{\text page } j\,\,{\text linking to } i} \alpha*x_j/d_j + (1-\alpha)*v_i |
</math> | </math> | ||
<br> | <br> |
Revision as of 17:50, 30 October 2016
PageRank is a system to associate an importance score to each page on the web. It was developed by Sergey Brin, Larry Page, Rajeev Motwoni, and Terry Winograd in the late 1990s. The fundamental idea behind the score is that important pages are those where we expect to find web surfers. With this idea, the two key ingredients to PageRank are:
i) A directed graph abstraction of the web; and
ii) A random surfer model of user behavior.
Idea i) implies that PageRank views each page on the web as a node in a directed graph. A link on a particular page corresponds to a directed edge between the nodes. See the illustration below for an example. This idea was not new to PageRank. Other ranking systems such as Kleinberg's HITS also used a directed graph abstraction of the web.
Idea ii) is the key insight to PageRank. Recall the fundamental idea: important pages are those where we expect to find web surfers. PageRank proposes the following model for the behavior of web surfers:
a) at a page, a surfer follows a link to another page uniformly at random
or
b) at a page, a surfer does something else.
Case a) is pretty straightforward. The surfer knows the current page, and merely must make a random choice for an outlink. In case b), we have no information on what the person actually does, and so we assume they randomly find a new page on the web to begin surfing again. Because the surfer behaves in a manner uncoupled with the connectivity of the web, we call this behavior ``jumping or ``teleporting. Also, variants of PageRank modify the behavior in step b) to teleport to a small set of specific nodes on the web. These are usually topic specific pages, for example only sports pages for a sports surfer, or personally interesting pages, for example your bookmarks.
To decide if the surfer follows action a) or b), we toss a weighted coin.
Thus, we assign probability to action a) and
probability
to action b).
At this point, we have a mathematical model for surfer behavior
and we can describe aggregate statistics about the behavior of the model.
For convenience, we assign an index to each page on the web. The
index is a simple label so we can write our equations without always
writing out full URLs. To be concrete, instead of writing
``the probability that the surfer is on slashdot.org we assign
slashdot.org an arbitrary but unique number , and then we call
``the probability that the surfer is on slashdot.org.
In general, let
be the probability of finding the surfer at page
for each page on the web.
Then the probability of finding the surfer at node
after one
step of surfing is
In this expression, is the number of links from page
and
is the probability of a random jump to page
.
We generally set
where
is the number of pages
on the web, though other choices are possible -- lookup personalized PageRank.
Let's continue by running through the components. To get to page
,
the surfer
either had to follow a link to page
or take a random jump to page
.
To get the probability of following a link to page
, we have to see
what pages
link to page
. In the above equation, this set of pages is represented
by the sum of
. For each of these pages, the surfer will
arrive at page
with probability
. But then those choices
only occur with probability
, thus
. Alternatively,
the surfer could have taken a random jump with probability
,
and picked page
with probability
.
Now, we just keep running this same equation for all pages in some
sequence until the probabilities of where we find the surfer stop changing.
And stop changing they do! Let's show this result by diving into
a matrix formulation of PageRank. We are going to look for a
fixed-point of the above equation.
Let be a matrix where the
th entry is the probability of
moving from page
to page
. (We now need to assume
that the page indices
run from 1 to the number of pages
on the web,
.) Formally,
Consider the matrix equation
The th row of this equation is precisely the update
for PageRank we described above. In this case, repeatedly applying
the formula corresponds to the algorithm:
- Set
to be any probability vector.
- Compute
.
- Set
to the probabilities computed in
,
i.e. just copy everything in to
)
- Go back to 2 and repeat.
All we have left to do is show that this iteration converges.
If the above algorithm converges, then the probabilities in
and
will be the same. Consequently, we need
This is really the linear system:
Before we can address convergence, we need to establish
whether is unique.
Every column of the matrix
has entries that sum to 1
or 0. Assuming that no page links to itself, then
is diagonally dominant when
,
and thus,
is unique.
Does the iteration find a unique
? Yes.
Let
be the solution and let
and
be adjacency iterations. Then
Now, if we take the 1-norm of both sides of this equation:
Thus, we reduce the norm to the exact solution by a factor
of at each iteration. This means the iteration
converges.
Dgleich 07:10, 10 August 2010 (PDT)