PageRank Spark implementation

最新推荐文章于 2023-03-28 00:25:03 发布

原创最新推荐文章于 2023-03-28 00:25:03 发布 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

D&A 同时被 3 个专栏收录

21 篇文章

订阅专栏

Python

16 篇文章

订阅专栏

Scala

2 篇文章

订阅专栏

本文介绍使用 Apache Spark 实现 PageRank 算法的过程，包括算法原理及其实现细节。通过 Scala 和 Python 的示例代码展示了如何基于链接矩阵计算网页排名。

As you know, PageRank is very famous algorithm. For the detail of pagerank defination and implemenation, you can refer tohttps://en.wikipedia.org/wiki/PageRank

There are many implementations, I have written some programs that had implemented it before. This time I try to use spark.

The basic idea is below:

Give pages ranks(or scores) based on links to them.

>> Links from many pages -> high rank

>> Links from high-rank pages -> high rank

Algorithm:

1. Start each page at rank of 1
2. On each iteration, have page p contribute rank_of_p / |neighbors_of_p|
3. Set each page's rank to 0.15 + 0.85 * contribs

The Spark Program seems easy and short.

scala code:

object PageRank {
  // get link matrix
  val mat = sc.textFile("/home/will/myspace/mydev/mytest/sparktest/scalaspark/data/mat2.txt")

  // get the links of every page node
  val links = mat.map(line => {
    val parts = line.split("\\s+")
    (parts(0), parts(1))
  }).distinct().groupByKey()

  // initialize each page node's rank to 1.0
  var ranks = links.mapValues(v => 1.0)

  // set the iteration time to 10
  val ITERATIONS = 10

  // compute the page rank of each page node
  for (i <- 0 until ITERATIONS) {
    val contributions = links.join(ranks).flatMap {
      case (pageId, (links, rank)) =>
        links.map(dest => (dest, rank / links.size))
    }
    
    ranks = contributions.reduceByKey((x, y) => x + y).mapValues(v => 0.15 + 0.85 * v)
  }

  val rank_array = ranks.take(10)

  // print the result
  for (i <- 0 until rank_array.size) {
    println(rank_array(i))
  }
}

Python code:

# In[1]:

filename1 = "/home/will/myspace/mydev/mytest/sparktest/scalaspark/data/mat2.txt"
# get link matrix
mat = sc.textFile(filename1, 4, 0)


# In[3]:

# because the matrix is small, so that can be collected
mat.collect()


# In[4]:

import re
def parseNeighbors(urls):
    """Parses a urls pair string into urls pair."""
    parts = re.split(r'\s+', urls)
    return parts[0], parts[1]


# In[5]:

def computeContribs(urls, rank):
    """Calculates URL contributions to the rank of other URLs."""
    num_urls = len(urls)
    for url in urls:
        yield (url, rank / num_urls)


# In[6]:

# get the links of every page node
links = mat.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()

# below method the get links will lead out of memory error
# 

#links = mat.map(lambda line: line.split(" ")).map(lambda l: (l[0], l[1])).distinct().groupByKey().cache()#.mapValues(list).collect()

#links = mat.map(lambda line: (line.split(" ")[0], line.split(" ")[1])).distinct().groupByKey().cache()


# In[7]:

links.count()


# In[8]:

# Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.
ranks = links.map(lambda (url, neighbors): (url, 1.0))


# In[12]:

# compute pagerank
from operator import add

ITERATIONS = 10

# Calculates and updates URL ranks continuously using PageRank algorithm.
for iteration in xrange(ITERATIONS):
    # Calculates URL contributions to the rank of other URLs.
    contribs = links.join(ranks).flatMap(lambda (url, (urls, rank)): computeContribs(urls, rank))
    # Re-calculates URL ranks based on neighbor contributions.
    ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)


# In[13]:

ranks.count()


# In[14]:

# print the result
for (link, rank) in ranks.collect():
    print "%s has rank: %s." % (link, rank)

The running result is below:

0 has rank: 0.772702281464.
6 has rank: 0.56251510134.
1 has rank: 1.72864431597.
7 has rank: 0.56251510134.
2 has rank: 1.14027517155.
8 has rank: 0.59949206817.
3 has rank: 0.970068542695.
9 has rank: 1.45593564966.
4 has rank: 1.23778322511.
5 has rank: 0.970068542695.

The input file of graph nodes like below: