Spark User-guide Summary - Basic Programming

本文介绍了如何使用PySpark进行大数据处理,包括安装配置、基本操作、数据抽象、MapReduce实现及集群部署等内容。

Use pip command to install pyspark

Simple operations in Spark shell

data abstraction, abstract a file to Spark DataFrame

textFile = spark.read.text("README.md")
textFile.count()  # Number of rows in this DataFrame
textFile.first()  # First row in this DataFrame

linesWithSpark = textFile.filter(textFile.value.contains("Spark")) 
# Transfer the DataFrame to a new one

textFile.select(size(split(textFile.value, "\s+")).name("numWords"))
    .agg(max(col("numWords"))).collect()
# select takes Col as para and create a new Col "numWords"
# agg counts the most word of Col "numWords"

implementing MapReduce:

wordCounts = textFile.select(explode(split(textFile.value, 
    "\s+")).alias("word")).groupBy("word").count()
# explode transfers from line to word
# groupby and count to count per word number
wordCounts.collect()

hold data lines in memory by using cache:

lines.cache()
Use Python file directly

To build PySpark package, first add to setup.py

$ YOUR_SPARK_HOME/bin/spark-submit --master local[4] SimpleApp.py

Write application through PySpark, named SimpleApp.py

"""SimpleApp.py"""
from pyspark.sql import SparkSession
logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
spark = SparkSession.builder().appName(appName).master(master).getOrCreate()
logData = spark.read.text(logFile).cache()
spark.stop()

To run this app, use $ YOUR_SPARK_HOME/bin/spark-submit --master local[4] SimpleApp.py
or if PySpark is in Python package, just run python SimpleApp.py

Run Spark on Cluster

To access the cluster, use SparkContext Object, which is initialized by SparkConf

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

In PySpark shell, there is already a SparkContext Object named sc
Example, create a parallelized collection holding the numbers 1 to 5:

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

then, we can use distData.reduce(lambda a, b: a + b) as reduce opeartion

To import external dataset and use MapReduce

distFile = sc.textFile("data.txt")
distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b)

or use function to map

def func()
distFile.map(func)

To change the file saving format, see http://spark.apache.org/docs/...

To print the data on the single machine: rdd.foreach(println)
For cluster, rdd.foreach(println) will only print the data on itself so that it can not be used on driver, rdd.collect().foreach(println) will collect all the data so that it will go out of the memory, the safe way is:

rdd.take(100).foreach(println)
Key-value pairs

To count how many times each line occurs,

lines = sc.textFile("data.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)

Operations like reduceByKey, sortByKey only accept Key-value pairs as para

Shared variables

create broadcast variable

broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value

create accumulator variable

accum = sc.accumulator(0)
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
# accum.value is 10

Note: sc.parallelize(),sc.textFile() has same function, used on python internal data and external data respectively

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值