为什么70%的大数据毕设第一次被打回？基于Hadoop的城镇居民食品消费量数据分析与可视化系统给你答案

IT跃迁谷毕设展

于 2025-08-05 16:04:55 发布

阅读量562

点赞数 19

CC 4.0 BY-SA版权

文章标签：大数据课程设计 hadoop 数据分析 spark python numpy

本文链接：https://blog.youkuaiyun.com/weixin_53783806/article/details/149939627

💖💖作者：IT跃迁谷毕设展
💙💙个人简介：曾长期从事计算机专业培训教学，本人也热爱上课教学，语言擅长Java、微信小程序、Python、Golang、安卓Android等，开发项目包括大数据、深度学习、网站、小程序、安卓、算法。平常会做一些项目定制化开发、代码讲解、答辩教学、文档编写、也懂一些降重方面的技巧。平常喜欢分享一些自己开发中遇到的问题的解决办法，也喜欢交流技术，大家有技术代码这一块的问题可以问我！
💛💛想说的话：感谢大家的关注与支持！
💜💜
Java实战项目集
 微信小程序实战项目集
 Python实战项目集
 安卓Android实战项目集
 大数据实战项目集

💕💕文末获取源码

基于Hadoop的城镇居民食品消费量数据分析与可视化系统-功能介绍

基于Hadoop的城镇居民食品消费量数据分析与可视化系统是一个综合应用大数据技术的分析平台，核心采用Hadoop和Spark构建分布式计算框架，通过HDFS实现海量数据的存储与管理。系统支持Python和Java双语言开发，后端采用Django或Spring Boot(Spring+SpringMVC+Mybatis)框架，前端基于Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery技术栈构建直观的可视化界面。功能上，系统从时间维度、空间维度、食品类别维度和消费者行为维度四个方面对城镇居民食品消费数据进行全面分析，包括年度食品消费总量变化趋势、各类食品消费量年度变化率、食品消费结构的时间演变、省级食品消费量分布、区域聚类分析、南北方食品消费差异、主食与副食消费比例分析、动物性与植物性食品消费对比等25个细分功能模块。通过Spark SQL、Pandas和NumPy等技术，系统能够高效处理和分析2015-2021年全国31个省份的食品消费数据，生成多维度的统计结果和可视化图表，为政府制定粮食安全策略、企业优化供应链以及研究居民饮食结构变化提供科学依据。

基于Hadoop的城镇居民食品消费量数据分析与可视化系统-选题背景意义

近年来，中国城镇居民的食品消费结构正经历着深刻变革。据国家统计局数据显示，2015-2021年间，我国城镇居民人均食品消费支出年均增长率达5.8%，远高于同期CPI食品类价格指数增速。特别是肉类、水产品和奶制品等高价值食品的消费量呈现显著上升趋势，2021年城镇居民人均肉类消费量比2015年增长了23.7%，奶制品消费量增长了31.2%。与此同时，不同地区、不同人群的食品消费差异也日益凸显，东部沿海省份与西部内陆省份的食品消费结构差异系数高达0.42。这些复杂多变的消费数据对传统分析方法提出了挑战，需借助大数据技术进行深入挖掘和分析。
开发基于Hadoop的城镇居民食品消费量数据分析与可视化系统具有重要意义。在理论层面，该系统将大数据技术与消费行为分析相结合，丰富了食品消费研究的方法论体系，为探索消费结构演变规律提供了新视角。在实践层面，系统通过对食品消费时空分布、结构变化和消费者行为的多维分析，能够为政府制定粮食安全策略和营养健康政策提供数据支持；帮助食品生产企业把握市场需求变化趋势，优化产品结构和供应链管理；也为研究机构深入了解城镇化进程中的居民饮食习惯变迁提供科学依据。

基于Hadoop的城镇居民食品消费量数据分析与可视化系统-技术选型

大数据框架：Hadoop+Spark（本次没用Hive，支持定制）
开发语言：Python+Java（两个版本都支持）
后端框架：Django+Spring Boot(Spring+SpringMVC+Mybatis)（两个版本都支持）
前端：Vue+ElementUI+Echarts+HTML+CSS+JavaScript+jQuery
详细技术点：Hadoop、HDFS、Spark、Spark SQL、Pandas、NumPy
数据库：MySQL

基于Hadoop的城镇居民食品消费量数据分析与可视化系统-视频展示

为什么70%的大数据毕设第一次被打回？基于Hadoop的城镇居民食品消费量数据分析与可视化系统给你答案

基于Hadoop的城镇居民食品消费量数据分析与可视化系统-图片展示

在这里插入图片描述

基于Hadoop的城镇居民食品消费量数据分析与可视化系统-代码展示

//大数据部分代码展示

# 核心功能1: 年度食品消费总量变化趋势分析

def analyze_annual_consumption_trend(spark_session, data_path):

    # 读取HDFS上的食品消费数据

    df = spark_session.read.csv(data_path, header=True, inferSchema=True)

    # 注册为临时视图，以便使用Spark SQL

    df.createOrReplaceTempView("food_consumption")

    # 使用Spark SQL计算年度食品消费总量

    annual_consumption = spark_session.sql("""

        SELECT year, 

               SUM(consumption_per_capita * population) AS total_consumption,

               AVG(consumption_per_capita) AS avg_consumption_per_capita

        FROM food_consumption

        GROUP BY year

        ORDER BY year

    """)

    # 计算同比增长率

    annual_consumption_pd = annual_consumption.toPandas()

    annual_consumption_pd['growth_rate'] = annual_consumption_pd['total_consumption'].pct_change() * 100

    # 使用NumPy进行趋势预测

    years = annual_consumption_pd['year'].values

    consumption = annual_consumption_pd['total_consumption'].values

    # 多项式拟合

    z = np.polyfit(years, consumption, 2)

    p = np.poly1d(z)

    # 预测未来两年的消费量

    future_years = np.array([max(years) + 1, max(years) + 2])

    predicted_consumption = p(future_years)

    # 保存结果到MySQL数据库

    conn = mysql.connector.connect(user='root', password='password', host='localhost', database='food_consumption')

    cursor = conn.cursor()

    # 插入历史数据

    for i, row in annual_consumption_pd.iterrows():

        cursor.execute("""

            INSERT INTO annual_consumption_trends (year, total_consumption, avg_consumption, growth_rate)

            VALUES (%s, %s, %s, %s)

        """, (int(row['year']), float(row['total_consumption']), float(row['avg_consumption_per_capita']), float(row['growth_rate']) if not np.isnan(row['growth_rate']) else None))

    # 插入预测数据

    for i, year in enumerate(future_years):

        cursor.execute("""

            INSERT INTO annual_consumption_predictions (year, predicted_consumption)

            VALUES (%s, %s)

        """, (int(year), float(predicted_consumption[i])))

    conn.commit()

    cursor.close()

    conn.close()

    return annual_consumption_pd, predicted_consumption

# 核心功能2: 区域聚类分析

def perform_regional_clustering(spark_session, data_path):

    # 读取HDFS上的省份食品消费数据

    df = spark_session.read.csv(data_path, header=True, inferSchema=True)

    # 将数据转换为Pandas DataFrame进行处理

    province_data = df.toPandas()

    # 数据预处理：提取各省份各类食品消费特征

    features = province_data.pivot_table(

        index='province', 

        columns='food_category', 

        values='consumption_per_capita',

        aggfunc='mean'

    ).fillna(0)

    # 标准化特征数据

    scaler = StandardScaler()

    scaled_features = scaler.fit_transform(features)

    # 使用K-means算法进行聚类分析

    # 首先确定最佳聚类数量

    inertia = []

    for k in range(2, 10):

        kmeans = KMeans(n_clusters=k, random_state=42)

        kmeans.fit(scaled_features)

        inertia.append(kmeans.inertia_)

    # 通过肘部法则确定最佳聚类数

    optimal_k = 2

    for i in range(1, len(inertia)):

        if inertia[i-1] - inertia[i] < inertia[i-1] * 0.2:  # 如果惯性下降不明显

            optimal_k = i + 1

            break

    # 使用最佳聚类数进行最终聚类

    final_kmeans = KMeans(n_clusters=optimal_k, random_state=42)

    clusters = final_kmeans.fit_predict(scaled_features)

    # 将聚类结果添加到原始数据

    cluster_results = pd.DataFrame({

        'province': features.index,

        'cluster': clusters

    })

    # 分析每个聚类的特征

    cluster_profiles = []

    for cluster_id in range(optimal_k):

        # 获取该聚类的省份

        provinces_in_cluster = cluster_results[cluster_results['cluster'] == cluster_id]['province'].tolist()

        # 计算该聚类的平均消费特征

        cluster_mean = features.loc[provinces_in_cluster].mean()

        # 找出该聚类的特征食品（消费量显著高于总体平均）

        overall_mean = features.mean()

        distinctive_foods = []

        for food in features.columns:

            if cluster_mean[food] > overall_mean[food] * 1.2:  # 高出20%以上

                distinctive_foods.append(food)

        cluster_profiles.append({

            'cluster_id': cluster_id,

            'provinces': provinces_in_cluster,

            'distinctive_foods': distinctive_foods,

            'avg_consumption': dict(cluster_mean)

        })

    # 保存结果到MySQL数据库

    conn = mysql.connector.connect(user='root', password='password', host='localhost', database='food_consumption')

    cursor = conn.cursor()

    # 保存聚类结果

    for _, row in cluster_results.iterrows():

        cursor.execute("""

            INSERT INTO province_clusters (province, cluster_id)

            VALUES (%s, %s)

        """, (row['province'], int(row['cluster'])))

    # 保存聚类特征

    for profile in cluster_profiles:

        cluster_id = profile['cluster_id']

        provinces = ','.join(profile['provinces'])

        distinctive_foods = ','.join(profile['distinctive_foods'])

        cursor.execute("""

            INSERT INTO cluster_profiles (cluster_id, provinces, distinctive_foods)

            VALUES (%s, %s, %s)

        """, (int(cluster_id), provinces, distinctive_foods))

        # 保存详细消费特征

        for food, consumption in profile['avg_consumption'].items():

            cursor.execute("""

                INSERT INTO cluster_consumption_features (cluster_id, food_category, avg_consumption)

                VALUES (%s, %s, %s)

            """, (int(cluster_id), food, float(consumption)))

    conn.commit()

    cursor.close()

    conn.close()

    return cluster_results, cluster_profiles

# 核心功能3: 健康食品消费指数构建

def build_health_food_consumption_index(spark_session, data_path, nutrition_data_path):

    # 读取食品消费数据和营养价值数据

    consumption_df = spark_session.read.csv(data_path, header=True, inferSchema=True)

    nutrition_df = spark_session.read.csv(nutrition_data_path, header=True, inferSchema=True)

    # 转换为Pandas DataFrame

    consumption_data = consumption_df.toPandas()

    nutrition_data = nutrition_df.toPandas()

    # 将食品消费数据与营养价值数据合并

    merged_data = pd.merge(

        consumption_data,

        nutrition_data,

        on='food_category',

        how='left'

    )

    # 定义各营养素的健康权重

    nutrition_weights = {

        'protein': 0.25,

        'fiber': 0.20,

        'vitamin': 0.15,

        'mineral': 0.15,

        'unsaturated_fat': 0.10,

        'antioxidant': 0.10,

        'saturated_fat': -0.15,

        'sugar': -0.10,

        'sodium': -0.10

    }

    # 计算每种食品的健康得分

    for nutrient, weight in nutrition_weights.items():

        if nutrient in merged_data.columns:

            merged_data[f'{nutrient}_score'] = merged_data[nutrient] * weight

    # 计算食品健康总分

    health_score_columns = [col for col in merged_data.columns if col.endswith('_score')]

    merged_data['health_score'] = merged_data[health_score_columns].sum(axis=1)

    # 按省份和年份计算健康食品消费指数

    health_index = merged_data.groupby(['province', 'year']).apply(

        lambda x: np.sum(x['consumption_per_capita'] * x['health_score']) / np.sum(x['consumption_per_capita'])

    ).reset_index(name='health_index')

    # 标准化健康指数（0-100分）

    min_index = health_index['health_index'].min()

    max_index = health_index['health_index'].max()

    health_index['health_index_normalized'] = 100 * (health_index['health_index'] - min_index) / (max_index - min_index)

    # 计算全国平均健康指数趋势

    national_trend = health_index.groupby('year')['health_index_normalized'].mean().reset_index()

    # 识别健康指数最高和最低的省份

    latest_year = health_index['year'].max()

    latest_data = health_index[health_index['year'] == latest_year]

    top_provinces = latest_data.nlargest(5, 'health_index_normalized')

    bottom_provinces = latest_data.nsmallest(5, 'health_index_normalized')

    # 分析健康指数与经济发展的相关性

    # 假设我们有GDP数据

    gdp_data = pd.DataFrame({

        'province': ['省份1', '省份2', '...'],

        'gdp_per_capita': [50000, 48000, '...']

    })

    health_gdp = pd.merge(

        latest_data,

        gdp_data,

        on='province',

        how='inner'

    )

    correlation = health_gdp['health_index_normalized'].corr(health_gdp['gdp_per_capita'])

    # 保存结果到MySQL数据库

    conn = mysql.connector.connect(user='root', password='password', host='localhost', database='food_consumption')

    cursor = conn.cursor()

    # 保存省份健康指数

    for _, row in health_index.iterrows():

        cursor.execute("""

            INSERT INTO health_food_index (province, year, health_index, health_index_normalized)

            VALUES (%s, %s, %s, %s)

        """, (row['province'], int(row['year']), float(row['health_index']), float(row['health_index_normalized'])))

    # 保存全国趋势

    for _, row in national_trend.iterrows():

        cursor.execute("""

            INSERT INTO national_health_index_trend (year, avg_health_index)

            VALUES (%s, %s)

        """, (int(row['year']), float(row['health_index_normalized'])))

    # 保存健康指数与GDP相关性

    cursor.execute("""

        INSERT INTO health_economic_correlation (year, correlation_coefficient)

        VALUES (%s, %s)

    """, (int(latest_year), float(correlation)))

    conn.commit()

    cursor.close()

    conn.close()

    return health_index, national_trend, correlation