PageRank简介

PageRank是Google用于网页排序的一种重要算法,通过链接结构确定网页的重要性。它考虑了网页之间的链接关系,每个网页的PageRank值由其指向的网页的PageRank值加权求和得出。在迭代过程中,为了避免‘排名沉没’和‘排名泄露’问题,引入了衰减因子。算法会不断迭代直到收敛,确保所有网页的总排名保持恒定。PageRank算法对于搜索引擎快速理解和排序网页内容起到关键作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

PageRank

《The PageRank Citation Ranking: Bringing Order to the Web》
链接: link.

动机

In this paper, we take advantage of the link structure of the Web to produce a global “importance” ranking of every web page. This ranking, called PageRank, helps search engines and users quickly make sense of the vast heterogeneity of the World Wide Web.
网页重要性排序

模型

link structure of the web

每个网页包含forward links和back links,也就是out edges与 in edges,我们可以知道每个page的forward links,但无法获取它们的back links。图源论文

PageRank

u u u 表示一个网页, F u F_{u} Fu(Forward)是 u u u 指向的网页 ; B u B_{u} Bu(Back)是指向 u u u的网页.
N u = ∣ F u ∣ N_{u}=\left|F_{u}\right| Nu=Fu F u F_{u} Fu的数量
c c c 用来做归一化 normalization (保证所有网页的总排名恒定)

简化版的PageRank, R R R代表网页的影响指数
R ( u ) = c ∑ v ∈ B u R ( v ) N v R(u)=c \sum_{v \in B_{u}} \frac{R(v)}{N_{v}} R(u)=cvBuNvR(v)
简单来说,网页 u u u的影响指数由指向它的网页自身的影响指数所决定,从下图就能直观的理解
在这里插入图片描述
对每个网页进行初始化后不断迭代直至收敛(论文中有关于收敛性的章节)

但是这一模型会出现两种特殊情况

  • 只有出没有入的网页 -自己变为0 -rank sink
  • 只有入没有出的网页 -不断吸收,其他网页变为0 -rank leak

这两种网页都会导致不断迭代中网页的影响指数变为0,针对这一问题改进的模型为

R ′ ( u ) = c ∑ v ∈ B u R ′ ( v ) N v + c E ( u ) R^{\prime}(u)=c \sum_{v \in B_{u}} \frac{R^{\prime}(v)}{N_{v}}+c E(u) R(u)=cvBuNvR(v)+cE(u)

“Let E ( u ) E(u) E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R ′ R^{\prime} R, to the Web pages which satisfies”

“where E ( u ) E(u) E(u) is some vector over the web pages that corresponds to a source of rank (see Section 6 ). Note that if E E E is all positive, c c c must be reduced to balance the equation. Therefore, this technique corresponds to a decay factor. In matrix notation we have R ′ = c ( A R ′ + E ) . R^{\prime}=c\left(A R^{\prime}+E\right) . R=c(AR+E). Since ∣ R ′ ∥ 1 = 1 \mid R^{\prime} \|_{1}=1 R1=1, we can rewrite this as R ′ = c ( A + E × 1 ) R ′ R^{\prime}=c(A+E \times \mathbf{1}) R^{\prime} R=c(A+E×1)R where 1 \mathbf{1} 1 is the vector consisting of all ones. So, R ′ R^{\prime} R is an eigenvector of ( A + E × 1 ) (A+E \times \mathbf{1}) (A+E×1).”

即增加一个正的衰减因子,就不会出现网页影响力为0的情况,同时这一个值也具有物理意义(但是我不能理解)。
“The additional factor E E E can be viewed as a way of modeling this behavior: the surfer periodically “gets bored” and jumps to a random page chosen based on the distribution in E E E. So far we have left E E E as a user defined parameter. In most tests we let E E E be uniform over all web pages with value α \alpha α. ”

computing PageRank

算法流程:
R 0 ← S  loop:  R i + 1 ← A R i d ← ∥ R i ∥ 1 − ∥ R i + 1 ∥ 1 R i + 1 ← R i + 1 + d E δ ← ∥ R i + 1 − R i ∥ 1  while  δ > ϵ \begin{array}{l} R_{0} \leftarrow S\\ \text { loop: }\\ \begin{aligned} R_{i+1} & \leftarrow A R_{i} \\ d & \leftarrow\left\|R_{i}\right\|_{1}-\left\|R_{i+1}\right\|_{1} \\ R_{i+1} & \leftarrow R_{i+1}+d E \\ \delta & \leftarrow\left\|R_{i+1}-R_{i}\right\|_{1} \\ \text { while } \delta>\epsilon & \end{aligned} \end{array} R0S loop: Ri+1dRi+1δ while δ>ϵARiRi1Ri+11Ri+1+dERi+1Ri1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值