Java Python CIS 5450 Homework 1: Data Wrangling and Cleaning (Fall 2024)
Hello future data scientists and welcome to CIS 5450! In this homework, you will familiarize yourself with Pandas and Polars! Both are cute animals and essential libraries for Data Science. This homework is focused on one of the most important tasks in Data Science, preparing datasets so that they can be analyzed, plotted, used for machine learning models, etc...
This homework will be broken into analyzing several datasets across four sections!
1. Working with Amazon Prime Video Data to understand the details behind its movies
2. Working on merged/joined versions of the datasets (more on this later though).
3. Regex
4. Working with Used Cars Dataset and Polars to see performance between Pandas, eager execution in Polars, and lazy execution in Polars.
IMPORTANT NOTE: Before starting, you must click on the "Copy To Drive" option in the top bar. This is the master notebook so you will not be able to save your changes without copying it ! Once you click on that, make sure you are working on that version of the notebook so that your work is saved
Run the following 4 cells to setup the notebook
%set_env HW_ID=cis5450_fall24_HW1
%%capture
!pip install penngrader-client
from penngrader.grader import *
import pandas as pd
import numpy as np
import seaborn as sns
from string import ascii_letters
import matplotlib.pyplot as plt
import datetime as dt
import requests
from lxml import html
import math
import re
import json
import os
!wget -nc https://storage.googleapis.com/penn-cis5450/credits.csv
!wget -nc https://storage.googleapis.com/penn-cis5450/titles.csv
What is Pandas?
Apart from animals, Pandas is a Python library to aid with data manipulation/analysis. It is built with support from Numpy. Numpy is another Python package/library that provides effi cient calculations for matrices and other math problems.
Let's also get familiarized with the PennGrader. It was developed specifi cally for 545 by a previous TA, Leonardo Murri.
PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it's right or wrong instantly. We then record your most recent answer in our backend database. Let's tr