CIS 5450 Homework 1: Data Wrangling and Cleaning (Fall 2024)Python

Java Python CIS 5450 Homework 1: Data Wrangling and Cleaning (Fall 2024)

Hello future data scientists and welcome to CIS 5450! In this homework, you will familiarize yourself with Pandas and Polars! Both are cute animals and essential libraries for Data Science. This homework is focused on one of the most important tasks in Data Science, preparing datasets so that they can be analyzed, plotted, used for machine learning models, etc...

This homework will be broken into analyzing several datasets across four sections!

1. Working with Amazon Prime Video Data to understand the details behind its movies

2. Working on merged/joined versions of the datasets (more on this later though).

3. Regex

4. Working with Used Cars Dataset and Polars to see performance between Pandas, eager execution in Polars, and lazy execution in Polars.

IMPORTANT NOTE: Before starting, you must click on the "Copy To Drive" option in the top bar. This is the master notebook so you will not be able to save your changes without copying it ! Once you click on that, make sure you are working on that version of the notebook so that your work is saved

Run the following 4 cells to setup the notebook

%set_env HW_ID=cis5450_fall24_HW1

%%capture

!pip install penngrader-client

from penngrader.grader import *

import pandas as pd

import numpy as np

import seaborn as sns

from string import ascii_letters

import matplotlib.pyplot as plt

import datetime as dt

import requests

from lxml import html

import math

import re

import json

import os

!wget -nc https://storage.googleapis.com/penn-cis5450/credits.csv

!wget -nc https://storage.googleapis.com/penn-cis5450/titles.csv

What is Pandas?

Apart from animals, Pandas is a Python library to aid with data manipulation/analysis. It is built with support from Numpy. Numpy is another Python package/library that provides effi cient calculations for matrices and other math problems.

Let's also get familiarized with the PennGrader. It was developed specifi cally for 545 by a previous TA, Leonardo Murri.

PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it's right or wrong instantly. We then record your most recent answer in our backend database. Let's tr

<think>好的,我需要帮助用户学习如何使用Python的pandas、NumPy和Jupyter进行数据整理。首先,我得确认用户的需求,他们可能刚接触数据分析,需要基础教程和步骤指导。根据用户提供的引用内容,特别是引用[1]提到Wes McKinney的书,应该推荐这本作为主要资源,同时结合引用[3]中的工具包安装验证。 首先,我应该先列出必要的工具安装步骤,确保用户环境正确。然后,分步骤讲解各个库的基本用法,比如pandas的数据结构DataFrame和Series,NumPy的数组操作,以及Jupyter的使用技巧。接着,数据整理的关键步骤,如数据清洗、转换、合并等,需要结合示例代码,让用户能实际操作。最后,生成相关问题,帮助用户进一步学习。 需要注意用户提到的所有数学表达式必须用$...$,比如在讲解NumPy的数组运算时,如果有公式的话。但数据整理可能涉及较少数学,不过如果有统计函数,比如标准差公式$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2}$,需要正确应用格式。同时,代码块需要用正确的语法,比如Python代码的缩进和函数调用。还要检查引用标记是否自然添加,比如在提到数据导入时引用相关材料的位置。 可能的问题点是用户需要具体的教程结构,而不仅仅是概述。所以应该分步骤,每个步骤包含库的功能、示例代码和解释。例如,数据加载部分,使用pandas的read_csv函数;数据清洗部分处理缺失值;数据转换使用apply函数等。同时,结合Jupyter的单元格操作,展示如何交互式处理数据。 另外,引用[2]比较了Pandas和NumPy,可以在讲解时提到两者的不同应用场景,比如Pandas适合表格数据,而NumPy适合数值运算。在性能对比部分,可以简要说明何时使用NumPy以提高效率。 最后,生成的相关问题需要围绕数据分析的深入内容,比如性能优化、高级数据清洗方法、可视化集成等,帮助用户扩展知识。</think>### 使用Python进行数据分析:Pandas、NumPy与Jupyter教程 以下是基于数据整理的完整学习路径,结合工具安装、核心功能及实际代码示例: --- #### **1. 环境配置与工具安装** - **安装Python与Jupyter**:推荐使用Anaconda发行版(包含常用库和Jupyter Notebook)[^3]。 - **验证安装**: ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt print("NumPy版本:", np.__version__) print("Pandas版本:", pd.__version__) ``` --- #### **2. NumPy基础:数值计算核心** - **数组操作**: ```python arr = np.array([1, 2, 3]) # 一维数组 matrix = np.array([[1, 2], [3, 4]]) # 二维数组 ``` - **数学运算**: - 均值的计算公式:$\mu = \frac{1}{N}\sum_{i=1}^N x_i$ ```python mean_value = np.mean(arr) # 计算均值 ``` --- #### **3. Pandas核心:数据整理** - **数据结构**: - **Series**(带标签的一维数据): ```python s = pd.Series([10, 20, 30], index=['a', 'b', 'c']) ``` - **DataFrame**(二维表格): ```python df = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']}) ``` - **数据清洗**: - 处理缺失值: ```python df.dropna() # 删除缺失值 df.fillna(0) # 填充为0 ``` --- #### **4. Jupyter交互式分析** - **单元格操作**: - 使用`Shift+Enter`执行代码并插入新单元格。 - Markdown单元格编写文档说明。 - **数据可视化集成**: ```python %matplotlib inline df.plot(kind='bar') # 绘制柱状图 ``` --- #### **5. 实战示例:数据整理流程** ```python # 加载数据 data = pd.read_csv("data.csv") # 筛选与过滤 filtered = data[data['age'] > 30] # 分组聚合 grouped = data.groupby('city')['income'].mean() # 合并数据集 merged = pd.merge(data, another_df, on='id') ``` --- #### **6. 学习资源推荐** - **《Python数据分析第三版》**(Wes McKinney):详细讲解Pandas、NumPy和Jupyter的实战技巧[^1]。 - **官方文档**:Pandas和NumPy文档提供API详细说明。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值