PySpark与HBase

本文详细介绍如何在PySpark中启动SparkShell和Jupyter,并设置默认配置,同时深入讲解PySpark与HBase的数据读写操作,包括配置参数、数据转换及RDD处理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

PySpark操作入门

1 启动PySpark

通过jupyter启动pyspark有两种方式:spark shell和jupyter启动。

1.1. Spark shell

修改/etc/profile,添加以下内容,告诉spark使用jupyter启动sparkshell:

export PYTHONPATH=${SPARK_HOME}/python/:$PYTHONPATH
export PYTHONPATH=${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:${PYTHONPATH}
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --allow-root"

通过以下指令开启yarn模式

pyspark --master yarn --deploy-mode client --name 项目名称
example: pyspark --master yarn --deploy-mode client --name test --executor-memory 7g --num-executors 4

1.2. Jupyter

进入jupyter或者python之后,可以直接import pyspark,然后用pyspark这个类调用Spark
本地模式:

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.app.name', 'test')
conf.set('spark.master', 'local')
conf.set('spark.driver.memory','4g')
conf.set('spark.files','/bigdata/usr/gdal-2.3.3/.libs/libgdal.so.20')
sc = pyspark.SparkContext(conf = conf)

yarn-client模式:

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.app.name', 'test')
conf.set('spark.master','yarn')
conf.set('spark.driver.memory','4g')
conf.set('spark.deploy.mode','client')
conf.set('spark.files','/bigdata/usr/gdal-2.3.3/.libs/libgdal.so.20')
# conf.set('spark.executor.extraClassPath,'/path/to/jars')
conf.set('spark.executor.memory','6g')
conf.set('spark.executor.cores','3')
conf.set('spark.executor.instances','4')
sc = pyspark.SparkContext(conf = conf)

1.3. 设置默认配置

可以修改spark-defaults.conf,例如:

spark.master                    yarn
spark.driver.memory             4g
spark.deploy.mode               client
spark.files                     /bigdata/usr/gdal-2.3.3/.libs/libgdal.so.20
spark.executor.memory           6g
spark.executor.cores            3
spark.executor.instances        4

这样,spark的启动就可以简化为:

import pyspark
sc = pyspark.SparkContext(appName='hbase')

其中有一行是

conf.set('spark.files','/bigdata/usr/gdal-2.3.3/.libs/libgdal.so.20')

是为了引入gdal库。

2 HBase与PySpark

读数据

host = '10.2.3.58'
table = 'student'
conf1 = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf1)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
        print (k, v)

写数据

host = '10.2.3.58'
table = 'student'
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf={"hbase.zookeeper.quorum": host, "hbase.mapred.outputtable": table,
            "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
            "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
            "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable",
            "mapreduce.output.fileoutputformat.outputdir": "/tmp"}

rawData = ['3,info,name,Rongcheng','4,info,name,Guanhua']
#( rowkey , [ row key , column family , column name , value ] )
print('准备写入数据')

sc.parallelize(rawData).map(lambda x: (x[0],x.split(','))).saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
# sc.parallelize(rawData).map(lambda x: (x[0],x.split(','))).saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)

hbase的部分代码来源于互联网

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值