PysparkNote005---批量写入Redis

Intro

  批量写redis,主要考虑两个点:

  • 数据分批-spark的foreachPartition
  • 批数据的分批插入redis,redis pipeline提交

Code

直接看代码

import redis
import json
import random
import pandas as pd
import functools
key_list = [f"test_{str(i).zfill(3)}" for i in range(100)]
value_list = [json.dumps({"name":"jack","sex":"male"})] * 100
df = pd.DataFrame({"key":key_list,"value":value_list})
df.head()
keyvalue
0test_000{"name": "jack", "sex": "male"}
1test_001{"name": "jack", "sex": "male"}
2test_002{"name": "jack", "sex": "male"}
3test_003{"name": "jack", "sex": "male"}
4test_004{"name": "jack", "sex": "male"}
from pyspark.sql import SparkSession
spark_df = spark.createDataFrame(df)
spark_df.show()
+--------+--------------------+
|     key|               value|
+--------+--------------------+
|test_000|{"name": "jack", ...|
|test_001|{"name": "jack", ...|
|test_002|{"name": "jack", ...|
|test_003|{"name": "jack", ...|
|test_004|{"name": "jack", ...|
|test_005|{"name": "jack", ...|
|test_006|{"name": "jack", ...|
|test_007|{"name": "jack", ...|
|test_008|{"name": "jack", ...|
|test_009|{"name": "jack", ...|
|test_010|{"name": "jack", ...|
|test_011|{"name": "jack", ...|
|test_012|{"name": "jack", ...|
|test_013|{"name": "jack", ...|
|test_014|{"name": "jack", ...|
|test_015|{"name": "jack", ...|
|test_016|{"name": "jack", ...|
|test_017|{"name": "jack", ...|
|test_018|{"name": "jack", ...|
|test_019|{"name": "jack", ...|
+--------+--------------------+
only showing top 20 rows
def insert2redis(part,  batch=50, expire_time=60):
    """
    @param part: rdd part;两列值key、value
    @param batch: 批量写入的数量
    @param expire_time: 过期时间
    @return:
    """
    db_param = {"host": '127.0.0.1', "port": 6379, "password": '12345', "db": 0}

    db = redis.Redis(host=db_param["host"],
                     port=db_param["port"],
                     password=db_param["password"],
                     db=db_param["db"],
                     encoding='utf-8',
                     decode_responses=True)
    pipe = db.pipeline()
    cnt = 0
    for row in part:
        pipe.hset(name=row["key"], mapping=json.loads(row["value"])) # 以字典形式写入
        pipe.expire(name=row["key"], time=expire_time + random.randint(0, 5))  # 过期时间随机化,防止批量过期
        cnt = cnt + 1
        if cnt > 0 and cnt % batch == 0:
            pipe.execute()
            print(f"第{cnt - batch}-{cnt}行数据插入redis!")
    # 最后一波数据如果不是batch余数,也推过去
    if cnt % batch != 0:
        pipe.execute()
        print(f"第{cnt - cnt % batch}-{cnt}行数据插入redis!")
    pipe.close()
    db.close()

spark_df.repartition(3).rdd.foreachPartition(
        functools.partial(insert2redis, batch=100, expire_time=60))

在这里插入图片描述

Ref

[1] https://testerhome.com/topics/25448
[2] https://blog.youkuaiyun.com/sinat_15793123/article/details/80594748
                                2022-04-24 于南京市江宁区九龙湖

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值