PageRank简介

最新推荐文章于 2024-04-10 22:56:59 发布

cicipupu

最新推荐文章于 2024-04-10 22:56:59 发布

阅读量241

点赞数

CC 4.0 BY-SA版权

分类专栏：经典论文文章标签：自然语言处理机器学习

本文链接：https://blog.youkuaiyun.com/cddddduck/article/details/120026916

经典论文专栏收录该内容

3 篇文章

订阅专栏

PageRank是Google用于网页排序的一种重要算法，通过链接结构确定网页的重要性。它考虑了网页之间的链接关系，每个网页的PageRank值由其指向的网页的PageRank值加权求和得出。在迭代过程中，为了避免‘排名沉没’和‘排名泄露’问题，引入了衰减因子。算法会不断迭代直到收敛，确保所有网页的总排名保持恒定。PageRank算法对于搜索引擎快速理解和排序网页内容起到关键作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

PageRank

《The PageRank Citation Ranking: Bringing Order to the Web》
链接: link.

动机

In this paper, we take advantage of the link structure of the Web to produce a global “importance” ranking of every web page. This ranking, called PageRank, helps search engines and users quickly make sense of the vast heterogeneity of the World Wide Web.
网页重要性排序

模型

link structure of the web

每个网页包含forward links和back links，也就是out edges与 in edges，我们可以知道每个page的forward links，但无法获取它们的back links。图源论文

PageRank

用 $u$ 表示一个网页， $F_{u}$ （Forward）是 $u$ 指向的网页； $B_{u}$ （Back）是指向 $u$ 的网页.
$N_{u}=\left|F_{u}\right|$ 为 $F_{u}$ 的数量
$c$ 用来做归一化 normalization （保证所有网页的总排名恒定）

简化版的PageRank， $R$ 代表网页的影响指数
$\sum_{v \in B_{u}} \frac{R(v)}{N_{v}}$
简单来说，网页 $u$ 的影响指数由指向它的网页自身的影响指数所决定，从下图就能直观的理解
在这里插入图片描述
对每个网页进行初始化后不断迭代直至收敛（论文中有关于收敛性的章节）

但是这一模型会出现两种特殊情况

只有出没有入的网页 -自己变为0 -rank sink
只有入没有出的网页 -不断吸收，其他网页变为0 -rank leak

这两种网页都会导致不断迭代中网页的影响指数变为0，针对这一问题改进的模型为

$R^{\prime}(u)=c \sum_{v \in B_{u}} \frac{R^{\prime}(v)}{N_{v}}+c E(u)$

“Let $E (u)$ be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, $R^{\prime}$ , to the Web pages which satisfies”

“where $E (u)$ is some vector over the web pages that corresponds to a source of rank (see Section 6 ). Note that if $E$ is all positive, $c$ must be reduced to balance the equation. Therefore, this technique corresponds to a decay factor. In matrix notation we have $R^{\prime}=c\left(A R^{\prime}+E\right) .$ Since $\mid R^{\prime} \|_{1}=1$ , we can rewrite this as $R^{\prime}=c(A+E \times \mathbf{1}) R^{\prime}$ where $\mathbf{1}$ is the vector consisting of all ones. So, $R^{\prime}$ is an eigenvector of $\times \mathbf{1})$ .”

即增加一个正的衰减因子，就不会出现网页影响力为0的情况，同时这一个值也具有物理意义（但是我不能理解）。
“The additional factor $E$ can be viewed as a way of modeling this behavior: the surfer periodically “gets bored” and jumps to a random page chosen based on the distribution in $E$ . So far we have left $E$ as a user defined parameter. In most tests we let $E$ be uniform over all web pages with value $\alpha$ . ”

computing PageRank

算法流程：
$\begin{array}{l} R_{0} \leftarrow S\\ \text { loop: }\\ \begin{aligned} R_{i+1} & \leftarrow A R_{i} \\ d & \leftarrow\left\|R_{i}\right\|_{1}-\left\|R_{i+1}\right\|_{1} \\ R_{i+1} & \leftarrow R_{i+1}+d E \\ \delta & \leftarrow\left\|R_{i+1}-R_{i}\right\|_{1} \\ \text { while } \delta>\epsilon & \end{aligned} \end{array}$