大数据毕设选题推荐-基于Spark的超市销售数据统计分析系统【Hadoop、Spark、python、大屏可视化】

最新推荐文章于 2025-08-06 19:16:41 发布

BYSJMG

最新推荐文章于 2025-08-06 19:16:41 发布

阅读量531

点赞数 13

CC 4.0 BY-SA版权

分类专栏： spark大数据文章标签：大数据课程设计 spark python django hadoop 毕业设计

本文链接：https://blog.youkuaiyun.com/BYSJLG/article/details/149977883

spark大数据专栏收录该内容

7 篇文章

订阅专栏

精彩专栏推荐订阅：在下方主页👇🏻👇🏻👇🏻👇🏻

💖🔥作者主页：计算机毕设木哥🔥 💖

文章目录

一、项目介绍
二、视频展示
三、开发环境
四、系统展示
五、代码展示
六、项目文档展示
七、项目总结
<font color=#fe2c24 >大家可以帮忙点赞、收藏、关注、评论啦 👇🏻

一、项目介绍

随着零售业数字化转型的深入推进，超市行业已积累了海量的销售交易数据。据中国连锁经营协会发布的《2022年中国零售数字化发展报告》显示，超过78%的大型连锁超市已建立完善的销售数据采集系统，日均产生交易数据量达TB级别，但仅有23%的企业能有效利用这些数据进行深度分析和决策支持。传统的数据处理方法面对如此规模的数据已显得力不从心，处理效率低下且分析维度有限。与此同时，消费者购物行为日益复杂多变，市场竞争加剧，超市经营者迫切需要一套能够处理大规模数据并提供多维度分析的系统，以便及时调整经营策略。在这样的背景下，基于Spark的超市销售数据统计分析系统应运而生，该系统利用Hadoop和Spark的分布式计算能力，结合Python和Django开发技术，旨在解决大规模销售数据的高效处理和深度分析问题。

开发基于Spark的超市销售数据统计分析系统具有重要的实际意义。对超市经营者而言，该系统通过对商品销售、时间维度、促销效果、顾客行为和商品关联等多维度的深入分析，能够帮助管理层精准把握销售规律，优化库存管理，提高货架空间利用效率，减少滞销商品占用资金，最终提升整体运营效率和盈利能力。实践表明，应用大数据分析技术的零售企业平均可提高毛利率2.5%至4%。从技术层面看，该系统将Spark、Hadoop等大数据处理技术与Django、Vue、Echarts等现代Web开发技术有机结合，不仅实现了对海量销售数据的高效处理，还通过直观的可视化展示增强了数据分析结果的可读性和实用性。这种技术融合为解决类似的大数据分析问题提供了可借鉴的实施路径，推动了大数据技术在零售领域的创新应用。

二、视频展示

大数据毕设热门选题推荐-基于Spark的超市销售数据统计分析系统【Hadoop、Spark、python、大屏可视化】

三、开发环境

大数据技术：Hadoop、Spark、Hive
开发技术：Python、Django框架、Vue、Echarts
软件工具：Pycharm、DataGrip、Anaconda
可视化工具 Echarts

四、系统展示

登录模块：

在这里插入图片描述

可视化分析模块展示：
在这里插入图片描述

五、代码展示

# 核心功能1: 畅销商品TOP20分析
def analyze_top_selling_products(spark_session, sales_data_path, top_n=20):
    # 读取销售数据
    sales_df = spark_session.read.parquet(sales_data_path)
    
    # 注册临时视图以便使用SQL查询
    sales_df.createOrReplaceTempView("sales")
    
    # 使用Spark SQL分析畅销商品
    top_products = spark_session.sql("""
        SELECT 
            product_code, 
            product_name, 
            SUM(quantity) as total_quantity,
            SUM(sale_amount) as total_sales,
            AVG(unit_price) as avg_price,
            COUNT(DISTINCT customer_id) as customer_count
        FROM sales
        GROUP BY product_code, product_name
        ORDER BY total_sales DESC
        LIMIT {}
    """.format(top_n))
    
    # 计算畅销商品的销售趋势
    top_product_codes = [row.product_code for row in top_products.collect()]
    trend_df = spark_session.sql("""
        SELECT 
            product_code,
            product_name,
            DATE_FORMAT(sale_date, 'yyyy-MM') as month,
            SUM(quantity) as monthly_quantity,
            SUM(sale_amount) as monthly_sales
        FROM sales
        WHERE product_code IN ({})
        GROUP BY product_code, product_name, DATE_FORMAT(sale_date, 'yyyy-MM')
        ORDER BY product_code, month
    """.format(",".join(f"'{code}'" for code in top_product_codes)))
    
    # 分析这些畅销商品的促销效果
    promo_impact = spark_session.sql("""
        SELECT 
            product_code,
            product_name,
            promotion_flag,
            COUNT(*) as transaction_count,
            SUM(quantity) as total_quantity,
            SUM(sale_amount) as total_sales,
            AVG(quantity) as avg_quantity_per_transaction
        FROM sales
        WHERE product_code IN ({})
        GROUP BY product_code, product_name, promotion_flag
    """.format(",".join(f"'{code}'" for code in top_product_codes)))
    
    # 转换为Python字典以便前端使用
    result = {
        "top_products": [row.asDict() for row in top_products.collect()],
        "sales_trend": [row.asDict() for row in trend_df.collect()],
        "promotion_impact": [row.asDict() for row in promo_impact.collect()]
    }
    
    return result

# 核心功能2: 顾客购物篮关联分析
def analyze_customer_basket(spark_session, sales_data_path, min_support=0.01, min_confidence=0.5):
    from pyspark.ml.fpm import FPGrowth
    from pyspark.sql.functions import col, collect_set, explode, array
    
    # 读取销售数据
    sales_df = spark_session.read.parquet(sales_data_path)
    
    # 按照顾客ID和销售日期分组，收集每次购物中购买的商品
    basket_df = sales_df.groupBy("customer_id", "sale_date").agg(
        collect_set("product_code").alias("items")
    )
    
    # 使用FP-Growth算法进行关联规则挖掘
    fpgrowth = FPGrowth(itemsCol="items", minSupport=min_support, minConfidence=min_confidence)
    model = fpgrowth.fit(basket_df)
    
    # 获取频繁项集
    frequent_itemsets = model.freqItemsets
    
    # 获取关联规则
    association_rules = model.associationRules
    
    # 计算规则的提升度(lift)
    def calculate_lift(rule_row):
        antecedent = set(rule_row.antecedent)
        consequent = set(rule_row.consequent)
        
        # 计算前件支持度
        antecedent_support = frequent_itemsets.filter(
            col("items").isin([list(antecedent)])
        ).select("freq").collect()[0].freq / basket_df.count()
        
        # 计算后件支持度
        consequent_support = frequent_itemsets.filter(
            col("items").isin([list(consequent)])
        ).select("freq").collect()[0].freq / basket_df.count()
        
        # 计算提升度
        lift = rule_row.confidence / consequent_support
        
        return {
            "antecedent": rule_row.antecedent,
            "consequent": rule_row.consequent,
            "confidence": rule_row.confidence,
            "lift": lift
        }
    
    # 应用提升度计算
    rules_with_lift = [calculate_lift(rule) for rule in association_rules.collect()]
    
    # 获取商品名称映射
    product_names = sales_df.select("product_code", "product_name").distinct().collect()
    product_name_map = {row.product_code: row.product_name for row in product_names}
    
    # 添加商品名称到规则中
    for rule in rules_with_lift:
        rule["antecedent_names"] = [product_name_map.get(code, code) for code in rule["antecedent"]]
        rule["consequent_names"] = [product_name_map.get(code, code) for code in rule["consequent"]]
    
    # 按提升度排序
    sorted_rules = sorted(rules_with_lift, key=lambda x: x["lift"], reverse=True)
    
    return {
        "association_rules": sorted_rules[:100],  # 返回提升度最高的100条规则
        "frequent_itemsets": [row.asDict() for row in frequent_itemsets.collect()[:100]]
    }

# 核心功能3: 顾客价值RFM分析
def perform_rfm_analysis(spark_session, sales_data_path, reference_date=None):
    from pyspark.sql.functions import datediff, max, count, sum, lit, when, col, expr
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.clustering import KMeans
    import datetime
    
    # 如果没有提供参考日期，使用当前日期
    if reference_date is None:
        reference_date = datetime.datetime.now().strftime("%Y-%m-%d")
    
    # 读取销售数据
    sales_df = spark_session.read.parquet(sales_data_path)
    
    # 计算RFM指标
    rfm_df = sales_df.groupBy("customer_id").agg(
        datediff(lit(reference_date), max("sale_date")).alias("recency"),
        count("*").alias("frequency"),
        sum("sale_amount").alias("monetary")
    )
    
    # 对RFM指标进行评分(1-5分)
    # 注意：recency越小越好，frequency和monetary越大越好
    rfm_scored = rfm_df.withColumn(
        "r_score",
        when(col("recency") <= 30, 5)
        .when(col("recency") <= 60, 4)
        .when(col("recency") <= 90, 3)
        .when(col("recency") <= 180, 2)
        .otherwise(1)
    ).withColumn(
        "f_score",
        when(col("frequency") >= 20, 5)
        .when(col("frequency") >= 10, 4)
        .when(col("frequency") >= 5, 3)
        .when(col("frequency") >= 2, 2)
        .otherwise(1)
    ).withColumn(
        "m_score",
        when(col("monetary") >= 5000, 5)
        .when(col("monetary") >= 2000, 4)
        .when(col("monetary") >= 1000, 3)
        .when(col("monetary") >= 500, 2)
        .otherwise(1)
    ).withColumn(
        "rfm_score", 
        expr("r_score * 100 + f_score * 10 + m_score")
    )
    
    # 使用KMeans进行客户分群
    assembler = VectorAssembler(
        inputCols=["r_score", "f_score", "m_score"],
        outputCol="features"
    )
    
    rfm_features = assembler.transform(rfm_scored)
    
    # 使用肘部法则确定最佳K值
    cost_list = []
    k_values = range(2, 11)
    
    for k in k_values:
        kmeans = KMeans(k=k, seed=42)
        model = kmeans.fit(rfm_features)
        cost = model.computeCost(rfm_features)
        cost_list.append(cost)
    
    # 选择最佳K值(这里简化为使用k=5)
    kmeans = KMeans(k=5, seed=42)
    model = kmeans.fit(rfm_features)
    
    # 预测客户分群
    clustered_df = model.transform(rfm_features)
    
    # 分析每个客户群的特征
    cluster_summary = clustered_df.groupBy("prediction").agg(
        count("*").alias("customer_count"),
        round(avg("recency"), 2).alias("avg_recency"),
        round(avg("frequency"), 2).alias("avg_frequency"),
        round(avg("monetary"), 2).alias("avg_monetary"),
        round(avg("r_score"), 2).alias("avg_r_score"),
        round(avg("f_score"), 2).alias("avg_f_score"),
        round(avg("m_score"), 2).alias("avg_m_score")
    ).orderBy("prediction")
    
    # 为客户群添加标签
    def assign_cluster_label(cluster_stats):
        labels = []
        for row in cluster_stats:
            if row.avg_r_score >= 4 and row.avg_f_score >= 4 and row.avg_m_score >= 4:
                label = "高价值客户"
            elif row.avg_r_score >= 3 and row.avg_f_score >= 3 and row.avg_m_score >= 3:
                label = "潜力客户"
            elif row.avg_r_score <= 2 and row.avg_f_score >= 3:
                label = "流失风险客户"
            elif row.avg_r_score >= 4 and row.avg_f_score <= 2:
                label = "新客户"
            elif row.avg_r_score <= 2 and row.avg_f_score <= 2:
                label = "休眠客户"
            else:
                label = "一般客户"
            
            labels.append({
                "cluster_id": row.prediction,
                "label": label,
                "stats": row.asDict()
            })
        return labels
    
    cluster_labels = assign_cluster_label(cluster_summary.collect())
    
    # 返回分析结果
    return {
        "rfm_data": [row.asDict() for row in rfm_scored.collect()],
        "cluster_summary": cluster_labels,
        "elbow_method": dict(zip(k_values, cost_list))
    }

六、项目文档展示

在这里插入图片描述

七、项目总结

本文设计并实现了基于Spark的超市销售数据统计分析系统，该系统充分利用Hadoop和Spark的分布式计算能力，结合Python、Django框架、Vue前端技术和Echarts可视化工具，构建了一套完整的超市销售数据分析解决方案。系统围绕五大核心分析维度展开，包括商品销售分析、时间维度销售分析、促销效果分析、顾客消费行为分析以及商品关联分析，通过多角度、多层次的数据挖掘与分析，为超市经营决策提供了科学依据。在实现过程中，系统重点攻克了畅销商品TOP20分析、顾客购物篮关联分析和顾客价值RFM分析等核心功能，利用Spark的高效计算能力处理海量销售数据，通过FP-Growth算法挖掘商品关联规则，并结合KMeans聚类算法实现了精准的顾客分群。系统的开发不仅实现了对超市销售数据的深度挖掘和价值提取，也为超市管理者优化商品结构、制定精准营销策略、提升顾客满意度提供了有力支持。通过本系统的应用，超市能够更好地把握销售规律，预测市场需求，最终实现销售额与利润的双重增长，充分体现了大数据技术在零售行业的实际应用价值。