1. venn diagram
full outer join (union)
inner join (intersection)
2. merge (join horizontally)
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'HR'}, {'Name': 'Sally', 'Role': 'TA'}, {'Name': 'James', 'Role': 'GD'}])
staff_df = staff_df.set_index('Name')
stu_df = pd.DataFrame([{'Name': 'James', 'School': 'Com'}, {'Name': 'Mike', 'School': 'Bus'}, {'Name': 'Sally', 'School': 'Eng'}])
stu_df = studf.set_index('Name')
merge
- use indices
union
pd.merge(staff_df, stu_df, how='outer', left_index=True, right_indtx=True)
intersection
pd.merge(staff_df, stu_df, how='inner', left_index=True, right_indtx=True)
left join
pd.merge(staff_df, stu_df, how='left', left_index=True, right_indtx=True)
Name | Role | School |
---|---|---|
Kelly | HR | NaN |
Sally | TA | Eng |
James | GD | Bus |
right join (ommited)
- use cloumns
pd.merge(staff_df, stu_df, how='right', on='Name')
- when 2 dataframes have conflitc
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'HR', 'Location': 'KKK'},
{'Name': 'Sally', 'Role': 'TA', 'Location': 'SSS'},
{'Name': 'James', 'Role': 'GD', 'Location': 'JJJ'}])
stu_df = pd.DataFrame([{'Name': 'James', 'School': 'Bus','Location': '1024jjj'},
{'Name': 'Mike', 'School': 'Law', 'Location': 'mmm22'},
{'Name': 'Sally', 'School': 'Eng', 'Location': '512sss'}])
pd.merge(staff_df, stu_df, how='left', on='Name')
Name Role Location_x School Location_y
multi-indexing
pd.merge(staff_df, stu_df, how='inner', on=['FName','LName'])
3. concat (join vertically)
3.1 cell magic “%%capture”
3.2 concat
frames = [df_2011, df_2012, df_2013]
pd.concat(frames, keys=['2011', '2-12', '2013'] )
#set an extra level of indices with 'keys'
Concatenation also has inner and outer method. If two dataframes do not have identical columns, and choose the outer method, some cells will be NaN. choose to do inner, then some observations will be dropped due to NaN values.
4. pandas idioms - pandorable
4.1 method chaining
def first_approach():
global df
return(df.where(df['SUMLEV']==50)
.dropna()
.set_index()['STNAEME','CTYNAME']
.rename(columns={'ESTIMATESBASE2010':'Estimates Base 2010'}))
df = pd.read_csv('datasets/census.csv')
timeit.timeit(first_approach, number=10)
0.4458
def second_approach():
global df
new_df = df[df['SUMLEV']==50]
new_df.set_index(['STNAME','CTYNAME'], inplace=True)
return new_df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})
df = pd.read_csv('datasets/census.csv')
timeit.timeit(second_approach, number=10)
0.0986, much faster than the first_approach
4.2 applymap or apply
def min_max(row):
data = row[['POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']]
return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis='cloumns')
a revised version
def min_max(row):
data = row[['POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']]
row['max'] = np.max(data)
row['min'] = np.min(data)
return row
df.apply(min_max, axis='columns')
lambda
rows = ['POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
df.apply(lambda x: np.max(x[rows], axis=1)
costomized manipulation
def get_state_region(x):
pass
df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))
5. group by
It is slow to iterate over each row in a dataframe. Groupby function can speed up. The idea is it splits dataframes into chunks based on some key values, applies computation on those chunks, then combines the results back together into another dataframe. refered to as split-apply-combine pattern.
5.1 splitting
df = pd.read_csv('datasets/census.csv')
%%timeit -n 3
for state in df['STNAME'].unique():
avg = np.average(df.where(df['STNAME'==state).dropna()['CENSUS2010POP'])
print('Counties in state ' + state +
' have an average population of '+ str(avg)
1.87s per loop
%%timeit -n 3
for group, frame in df.groupby('STNAME'):
avg= np.average(frame['CENSUS2010POP'])
print('Counties in state ' + group +
' have an average population of '+ str(avg)
39.2ms per loop with groupby, a huge difference in speed!
We can also use groupby if we want to work on only 1/3 of the states. We create function which returns [0, 1, 2] based on the first character of the STNAME. Then group by to split. It’s important to set ‘STNAME’ to index.
df = df.set_index('STNAME')
def set_batch_number(item):
if item[0]<'M':
return 0
if itme[1]<'Q':
return 1
return2
for group, frame in df.groupby(set_batch_number):
...
another example to goup data
df = pd.read_csv('datasets/listiongs.csv')
# set 2 columns of interest to index
df = df.set_index(["cancellation_policy","review_scores_value"])
for group, frame in df.groupby(level = (0,1)):
print(group)
('flexible', 2.0), ('flexible', 4.0), ...
('moderate', 2.0), ...
if we want to group by scores of 10’s from under 10, we could use a function to groupby
def group_func(itm):
if item[1] == 10:
return (item[0], '10')
else:
return (item[0], 'not 10')
for group, frame in df.groupby(by = group_func):
print(group)
('flexible', '10.0'), ('flexible', 'not 10.0')
('moderate', '10.0'), ('moderate', 'not 10.0')
5.2 Applying
3 categories: Aggregation, Transformation, Filtration
5.2.1 Aggregation
df.groupby('cancellation_policy').agg({'review_scores_value': np.average})
.agg() on groupby object is deprecated in latest version, so we have to write functions to implement this.
np.average dosnot ignore Nans, change to np.nanmean will be ok
We can extend this dictionary to aggregate by multiple functions or multiple columns.
df.groupby("cancellation_policy").agg({"review_scores_value":(np.nanmean,np.nanstd), "reviews_per_month":np.nanmean})
5.2.2 transformation
Transform() returns an object that is the same size as the group.Iit broadcasts the function over the grouped dataframe, returning a new dataframe. This makes merging dataframe easier.
cols=['cancellation_policy','review_scores_value']
transform_df=df[cols].groupby('cancellation_policy')
transform_df.rename({'review_scores_value':'mean_review_scores'}, axis='columns', inplace = True)
#now shape of transform_df is 512X2?
df = df.merge(transform_df, left_index=True, right_index=True)
#append the last column 'mean_review_scores'
df['mean_diff']=np.absolute(df['review_scores_value']-df['mean_review_scores'])
#append last column 'mena_diff'
5.2.3 filtering
df.groupby('cancellation_policy).filter(lambda x: np.nanmean(x['review_scores_value'])>9.2)
#filter those groups which have a mean rating above 9.2, not that of single hotel
6. scales
6.1 four scales
- ratio scale: units are equally spaced, + - * / is valid, eg. height, weight
- internal scale: equally spaced, no zero, eg. tempreture
- ordinal scale: order is important, not evenly spaced, eg. A+ A A-
- nominal scale: category, no order, eg. teams of a sport
6.2 examples
change type to category
df["Grades"].astype("category")
tell pandas that the data is ordered by first creating a new categorical data type with the list of the categories (in order) and the ordered=True flag
my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], ordered=True)
grades=df["Grades"].astype(my_categories)
compare the columns
df[df["Grades"]>"C"] #not expected
grades[grades>"C"] #operator works as expected
cut(): bin, not so clear with this example, because the 10 pace is not equall
pd.cut(df,10)
7. pivot table
pivot_table(values, index, columns, aggfunc, margins)
df['Rank_Level'] = df['world_rank'].apply(lambda x: create_category(x))
new_df = df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], margins=True)
print(new_df.columns) #multi index
new_df['mean']['First Tier'] #average scores of First Tier
new_df['mean']['First Tier'].idxmax() #index of max, find the country that has the maximum average score on First Tier
stack() and unstack(): to achieve a different shape of pivot table, stack() pivot the lowermost column index to become the innermost row index. unstack() pivot the inverse of stack()
new_df.stack()
8. date/time
Pandas has 4 main time classes: Timestamp, DatetimeIndex, Period, and PeriodIndex
8.1 Timestamp
pd.Timestamp('2021-7-1 10:22').isoweekday() #1 monday, 7 sunday
pd.Timestamp('2021-7-1 10:23).minute #23, year, month, ... also apply
8.2 Period
pd.Peroid('1/10/2016 10:23:50') # 'S'
pd.Peroid('1/10/2016 10:23') # 'T'
pd.Peroid('1/10/2016 10') # 'H'
pd.Peroid('1/10/2016') # 'D'
pd.Peroid('1/2016') # 'M'
pd.Peroid('2/2021') +5 # Peroid('2021-07', M'), 5 months passed
8.3 DatetimeIndex and PeroidIndex
#make Timestamp as index of a series
t1 = pd.Series(list('abc'),[pd.Timestamp('2021-7-1'),pd.Timestamp('2021-7-2'),pd.Timestamp('2021-7-3')])
2021-07-01 a
2021-07-02 b
2021-07-03 c
type of t1.index is DatetimeIndex.
similarly PeriodIndex as follows
t2 = pd.Series(list('def'), [pd.Period('2021-05'), pd.Period('2021-06'), pd.Period('2021-07')])
8.4 converting to Datetime
to_datetime()
d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']
ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=pd.to_datetime(d1), columns=list('ab'))
pd.to_datetime('4.7.12', dayfirst=True)
8.5 Timedelta
pd.Timestamp('9/3/2016') - pd.Timestamp('9/1/2016')
Timedelta('2 days 00:00:00')
pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')
Timestamp('2016-09-14 11:10:00')
8.6 offset
Offset is similar to timedelta, but it follows specific rules. Offset allows flexibility of time intervals. Besides hour, day, week, month, etc it also has business day, end of month, semi month begin etc.
pd.Timestamp('7/1/2021') + pd.offsets.Week()
#add a week ahead
pd.Timestamp('7/1/2021') + pd.offsets.MonthEnd()
#add to the month end
8.7 working with Dates in a dataframe
dates = pd.date_range('7-01-2021', periods=9, freq='3W-Fri')
#freq can also be '2W-Sun', 'B' for business day, 'QS-JUN' for quarter start in June
df = pd.DataFrame({'Count 1': 100+np.random.randint(-5,10,0).cumsum(), 'Count 2': 120+np.random.randint(-5,10,0)}, index=dates)
df.index.weekday_name #?not understood
df.diff() #?not understood
df.resample('M').mean() #mean for each month
df['2017':] # datetime indexing
9. assignment 3
9.1 clean and join 3 dataframes
- datafram Energy
Energy = pd.read_excel("assets/Energy Indicators.xls",skiprows=16,nrows=228,usecols=[2,3,4,5])
Energy = Energy.rename(columns={'Unnamed: 2':'Country','Renewable Electricity Production':'% Renewable'})
Energy = Energy[1:228] #skip line below column names
Energy=Energy.replace('...',np.NaN)
Energy['Energy Supply']=Energy['Energy Supply']*1000000
Energy['Country'] = Energy['Country'].str.rstrip('0123456789') #strip those number at the end
for i in range(1,227):
if (r'(') in Energy['Country'][i]:
str2 = re.findall('([\w\s]+)(?=[\s][\(][\w\s]+[\)])',Energy['Country'][i])
Energy['Country'][i]=str2[0]
#clean those country names with (***)
Energy = Energy.replace({'Country':{"Republic of Korea": "South Korea","United States of America": "United States","United Kingdom of Great Britain and Northern Ireland": "United Kingdom","China, Hong Kong Special Administrative Region": "Hong Kong"}})
#to_check = Energy['Country'].to_dict()
- dataframe GDP
GDP = pd.read_csv('assets/world_bank.csv',skiprows=4)
GDP.columns.str.strip()
GDP = GDP.replace({'Country Name':{"Korea, Rep.": "South Korea","Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"}})
#to_check = GDP['Country Name'].to_dict()
- dataframe ScimEn
ScimEn = pd.read_excel('assets/scimagojr-3.xlsx')
ScimEn.columns.str.strip()
ScimEn = ScimEn[0:15]
ScimEn = ScimEn.set_index('Country')
- join together
df = ScimEn.join(Energy.set_index('Country'), on='Country')
GDP_sub = GDP[['Country Name','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
df = df.join(GDP_sub.set_index('Country Name'), on='Country') #not on Country Name of GDP_sub
9.2 average GDP
15 countries and their 10years’ average GDP sorted in descending order
df['avgGDP']=df.iloc[:,10:20].mean(axis=1)
df.sort_values('avgGDP',ascending = False)
9.3 which country’s renewable is bigger than med
Create a new column with a 1 if the country’s % Renewable value is at or above the median, 0 below the median
med = df.loc[:,'% Renewable'].median()
df['renew above']=df.loc[:,'% Renewable']>med
df = df['renew above'].replace([True,False],[1,0])
9.4 set_index vs reset_index
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
df['Continent'] = pd.Series(ContinentDict)
df = df.reset_index()
df = df.set_index('Continent') #without reset in previous line, the original index 'Country' is overwritten via set_index.