PageRank
《The PageRank Citation Ranking: Bringing Order to the Web》
链接: link.
动机
In this paper, we take advantage of the link structure of the Web to produce a global “importance” ranking of every web page. This ranking, called PageRank, helps search engines and users quickly make sense of the vast heterogeneity of the World Wide Web.
网页重要性排序
模型
link structure of the web
每个网页包含forward links和back links,也就是out edges与 in edges,我们可以知道每个page的forward links,但无法获取它们的back links。
PageRank
用
u
u
u 表示一个网页,
F
u
F_{u}
Fu(Forward)是
u
u
u 指向的网页 ;
B
u
B_{u}
Bu(Back)是指向
u
u
u的网页.
N
u
=
∣
F
u
∣
N_{u}=\left|F_{u}\right|
Nu=∣Fu∣ 为
F
u
F_{u}
Fu的数量
c
c
c 用来做归一化 normalization (保证所有网页的总排名恒定)
简化版的PageRank,
R
R
R代表网页的影响指数
R
(
u
)
=
c
∑
v
∈
B
u
R
(
v
)
N
v
R(u)=c \sum_{v \in B_{u}} \frac{R(v)}{N_{v}}
R(u)=cv∈Bu∑NvR(v)
简单来说,网页
u
u
u的影响指数由指向它的网页自身的影响指数所决定,从下图就能直观的理解
对每个网页进行初始化后不断迭代直至收敛(论文中有关于收敛性的章节)
但是这一模型会出现两种特殊情况
- 只有出没有入的网页 -自己变为0 -rank sink
- 只有入没有出的网页 -不断吸收,其他网页变为0 -rank leak
这两种网页都会导致不断迭代中网页的影响指数变为0,针对这一问题改进的模型为
R ′ ( u ) = c ∑ v ∈ B u R ′ ( v ) N v + c E ( u ) R^{\prime}(u)=c \sum_{v \in B_{u}} \frac{R^{\prime}(v)}{N_{v}}+c E(u) R′(u)=cv∈Bu∑NvR′(v)+cE(u)
“Let E ( u ) E(u) E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R ′ R^{\prime} R′, to the Web pages which satisfies”
“where E ( u ) E(u) E(u) is some vector over the web pages that corresponds to a source of rank (see Section 6 ). Note that if E E E is all positive, c c c must be reduced to balance the equation. Therefore, this technique corresponds to a decay factor. In matrix notation we have R ′ = c ( A R ′ + E ) . R^{\prime}=c\left(A R^{\prime}+E\right) . R′=c(AR′+E). Since ∣ R ′ ∥ 1 = 1 \mid R^{\prime} \|_{1}=1 ∣R′∥1=1, we can rewrite this as R ′ = c ( A + E × 1 ) R ′ R^{\prime}=c(A+E \times \mathbf{1}) R^{\prime} R′=c(A+E×1)R′ where 1 \mathbf{1} 1 is the vector consisting of all ones. So, R ′ R^{\prime} R′ is an eigenvector of ( A + E × 1 ) (A+E \times \mathbf{1}) (A+E×1).”
即增加一个正的衰减因子,就不会出现网页影响力为0的情况,同时这一个值也具有物理意义(但是我不能理解)。
“The additional factor
E
E
E can be viewed as a way of modeling this behavior: the surfer periodically “gets bored” and jumps to a random page chosen based on the distribution in
E
E
E. So far we have left
E
E
E as a user defined parameter. In most tests we let
E
E
E be uniform over all web pages with value
α
\alpha
α. ”
computing PageRank
算法流程:
R
0
←
S
loop:
R
i
+
1
←
A
R
i
d
←
∥
R
i
∥
1
−
∥
R
i
+
1
∥
1
R
i
+
1
←
R
i
+
1
+
d
E
δ
←
∥
R
i
+
1
−
R
i
∥
1
while
δ
>
ϵ
\begin{array}{l} R_{0} \leftarrow S\\ \text { loop: }\\ \begin{aligned} R_{i+1} & \leftarrow A R_{i} \\ d & \leftarrow\left\|R_{i}\right\|_{1}-\left\|R_{i+1}\right\|_{1} \\ R_{i+1} & \leftarrow R_{i+1}+d E \\ \delta & \leftarrow\left\|R_{i+1}-R_{i}\right\|_{1} \\ \text { while } \delta>\epsilon & \end{aligned} \end{array}
R0←S loop: Ri+1dRi+1δ while δ>ϵ←ARi←∥Ri∥1−∥Ri+1∥1←Ri+1+dE←∥Ri+1−Ri∥1