一个基于PySpark的大数据分析小项目

原创

于 2024-12-19 11:49:45 发布 · 359 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#python #pandas #spark

二手车交易数据分析

课程作业做的一个数据分析的小项目，主要基于pyspark和pandas库，使用python3.11

数据来自datafoundation的一份有2000条数据的数据集

为了寻求品牌与价格间的关系，所以首先写了一段程序用于从“leixing”列中的具体车型提取出品牌名存为新的一列。

import pandas as pd
import re

def extract_brand(car_name):
    # Use regex to extract the brand (first word) from the car name
    match = re.match(r'^([\u4e00-\u9fff]+|\w+)', car_name)
    if match:
        return match.group(1)
    return ''

# Read the CSV file
df = pd.read_csv('guazi.csv', encoding='utf-8')

# Create a new column 'brand' by extracting the brand from 'leixing'
df['brand'] = df['leixing'].apply(extract_brand)

# Save the updated DataFrame to a new CSV file
df.to_csv('guolv.csv', index=False, encoding='utf-8')

print("Brand extraction complete. New file created.")

得到新的一个数据表如下