Data Science Intro by Michigan University

This blog covers Week 3 of the Data Science course from the University of Michigan, focusing on data manipulation using pandas in Python. Key topics include venn diagrams, horizontal merging with 'merge', vertical joining with 'concat', pandas idioms like method chaining, groupby operations for aggregation, transformation, and filtering, pivot tables, and date/time handling." 132785243,19673332,语音通信系统设计与实现详解,"['语音识别', '数字信号处理', '通信系统', 'Matlab实现', '语音编码']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. venn diagram

full outer join (union)
inner join (intersection)

2. merge (join horizontally)

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'HR'}, {'Name': 'Sally', 'Role': 'TA'}, {'Name': 'James', 'Role': 'GD'}])
staff_df = staff_df.set_index('Name')
stu_df = pd.DataFrame([{'Name': 'James', 'School': 'Com'}, {'Name': 'Mike', 'School': 'Bus'}, {'Name': 'Sally', 'School': 'Eng'}])
stu_df = studf.set_index('Name')

merge

use indices

union

pd.merge(staff_df, stu_df, how='outer', left_index=True, right_indtx=True)

intersection

pd.merge(staff_df, stu_df, how='inner', left_index=True, right_indtx=True)

left join

pd.merge(staff_df, stu_df, how='left', left_index=True, right_indtx=True)

Name	Role	School
Kelly	HR	NaN
Sally	TA	Eng
James	GD	Bus

right join (ommited)

use cloumns

pd.merge(staff_df, stu_df, how='right', on='Name')

when 2 dataframes have conflitc

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'HR', 'Location': 'KKK'},
	{'Name': 'Sally', 'Role': 'TA',   'Location': 'SSS'}, 
	{'Name': 'James', 'Role': 'GD', 'Location': 'JJJ'}])
stu_df = pd.DataFrame([{'Name': 'James', 'School': 'Bus','Location': '1024jjj'},
	{'Name': 'Mike', 'School': 'Law', 'Location': 'mmm22'},
	{'Name': 'Sally', 'School': 'Eng', 'Location': '512sss'}])
pd.merge(staff_df, stu_df, how='left', on='Name')

Name 	Role 	Location_x 	School 	Location_y

multi-indexing

pd.merge(staff_df, stu_df, how='inner', on=['FName','LName'])

3. concat (join vertically)

3.1 cell magic “%%capture”

3.2 concat

frames = [df_2011, df_2012, df_2013]
pd.concat(frames, keys=['2011', '2-12', '2013'] ) 
#set an extra level of indices with 'keys'

Concatenation also has inner and outer method. If two dataframes do not have identical columns, and choose the outer method, some cells will be NaN. choose to do inner, then some observations will be dropped due to NaN values.

4. pandas idioms - pandorable

4.1 method chaining

def first_approach():
	global df
	return(df.where(df['SUMLEV']==50)
		.dropna()
		.set_index()['STNAEME','CTYNAME']
		.rename(columns={'ESTIMATESBASE2010':'Estimates Base 2010'}))
df = pd.read_csv('datasets/census.csv')
timeit.timeit(first_approach, number=10)

0.4458

def second_approach():
    global df
    new_df = df[df['SUMLEV']==50]
    new_df.set_index(['STNAME','CTYNAME'], inplace=True)
    return new_df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})
df = pd.read_csv('datasets/census.csv')
timeit.timeit(second_approach, number=10)

0.0986, much faster than the first_approach

4.2 applymap or apply

def min_max(row):
	data = row[['POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis='cloumns')

a revised version

def min_max(row):
	data = row[['POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
     row['max'] = np.max(data)
     row['min'] = np.min(data)
     return row
 df.apply(min_max, axis='columns')

lambda

rows = ['POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
df.apply(lambda x: np.max(x[rows], axis=1)

costomized manipulation

def get_state_region(x):
pass
df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))

5. group by

It is slow to iterate over each row in a dataframe. Groupby function can speed up. The idea is it splits dataframes into chunks based on some key values, applies computation on those chunks, then combines the results back together into another dataframe. refered to as split-apply-combine pattern.

5.1 splitting

df = pd.read_csv('datasets/census.csv')
%%timeit -n 3
for state in df['STNAME'].unique():
	avg = np.average(df.where(df['STNAME'==state).dropna()['CENSUS2010POP'])
print('Counties in state ' + state + 
	' have an average population of  '+ str(avg)

1.87s per loop

%%timeit -n 3
for group, frame in df.groupby('STNAME'):
	avg= np.average(frame['CENSUS2010POP'])
	print('Counties in state ' + group + 
	' have an average population of  '+ str(avg)

39.2ms per loop with groupby, a huge difference in speed!
We can also use groupby if we want to work on only 1/3 of the states. We create function which returns [0, 1, 2] based on the first character of the STNAME. Then group by to split. It’s important to set ‘STNAME’ to index.

df = df.set_index('STNAME')
def set_batch_number(item):
	if item[0]<'M':
			return 0
    if itme[1]<'Q':
    		return 1
    return2
for group, frame in df.groupby(set_batch_number):
	...

another example to goup data

df = pd.read_csv('datasets/listiongs.csv')
# set 2 columns of interest to index
df = df.set_index(["cancellation_policy","review_scores_value"])
for group, frame in df.groupby(level = (0,1)):
	print(group)

('flexible', 2.0), ('flexible', 4.0), ...
('moderate', 2.0), ...

if we want to group by scores of 10’s from under 10, we could use a function to groupby

def group_func(itm):
	if item[1] == 10:
		return (item[0], '10')
	else:
		return (item[0], 'not 10')
for group, frame in df.groupby(by = group_func):
	print(group)

('flexible', '10.0'), ('flexible', 'not 10.0')
('moderate', '10.0'), ('moderate', 'not 10.0')

5.2 Applying

3 categories: Aggregation, Transformation, Filtration

5.2.1 Aggregation

df.groupby('cancellation_policy').agg({'review_scores_value': np.average})

.agg() on groupby object is deprecated in latest version, so we have to write functions to implement this.

np.average dosnot ignore Nans, change to np.nanmean will be ok

We can extend this dictionary to aggregate by multiple functions or multiple columns.

df.groupby("cancellation_policy").agg({"review_scores_value":(np.nanmean,np.nanstd), "reviews_per_month":np.nanmean})

5.2.2 transformation

Transform() returns an object that is the same size as the group.Iit broadcasts the function over the grouped dataframe, returning a new dataframe. This makes merging dataframe easier.

cols=['cancellation_policy','review_scores_value']
transform_df=df[cols].groupby('cancellation_policy')
transform_df.rename({'review_scores_value':'mean_review_scores'}, axis='columns', inplace = True)
#now shape of transform_df is 512X2?
df = df.merge(transform_df, left_index=True, right_index=True)
#append the last column 'mean_review_scores'
df['mean_diff']=np.absolute(df['review_scores_value']-df['mean_review_scores'])
#append last column 'mena_diff'

5.2.3 filtering

df.groupby('cancellation_policy).filter(lambda x: np.nanmean(x['review_scores_value'])>9.2)
#filter those groups which have a mean rating above 9.2, not that of single hotel

6. scales

6.1 four scales

ratio scale: units are equally spaced, + - * / is valid, eg. height, weight
internal scale: equally spaced, no zero, eg. tempreture
ordinal scale: order is important, not evenly spaced, eg. A+ A A-
nominal scale: category, no order, eg. teams of a sport

6.2 examples

change type to category

df["Grades"].astype("category")

tell pandas that the data is ordered by first creating a new categorical data type with the list of the categories (in order) and the ordered=True flag

my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], ordered=True)
grades=df["Grades"].astype(my_categories)

compare the columns

df[df["Grades"]>"C"] #not expected
grades[grades>"C"] #operator works as expected

cut(): bin, not so clear with this example, because the 10 pace is not equall

pd.cut(df,10)

7. pivot table

pivot_table(values, index, columns, aggfunc, margins)

df['Rank_Level'] = df['world_rank'].apply(lambda x: create_category(x))
new_df = df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], margins=True)
print(new_df.columns) #multi index
new_df['mean']['First Tier'] #average scores of First Tier
new_df['mean']['First Tier'].idxmax() #index of max, find the country that has the maximum average score on First Tier

stack() and unstack(): to achieve a different shape of pivot table, stack() pivot the lowermost column index to become the innermost row index. unstack() pivot the inverse of stack()

new_df.stack()

8. date/time

Pandas has 4 main time classes: Timestamp, DatetimeIndex, Period, and PeriodIndex

8.1 Timestamp

pd.Timestamp('2021-7-1 10:22').isoweekday() #1 monday, 7 sunday
pd.Timestamp('2021-7-1 10:23).minute #23, year, month, ... also apply

8.2 Period

pd.Peroid('1/10/2016 10:23:50') # 'S'
pd.Peroid('1/10/2016 10:23') # 'T'
pd.Peroid('1/10/2016 10') # 'H'
pd.Peroid('1/10/2016') # 'D'
pd.Peroid('1/2016') # 'M'
pd.Peroid('2/2021') +5 # Peroid('2021-07', M'), 5 months passed

8.3 DatetimeIndex and PeroidIndex

#make Timestamp as index of a series
t1 = pd.Series(list('abc'),[pd.Timestamp('2021-7-1'),pd.Timestamp('2021-7-2'),pd.Timestamp('2021-7-3')])

2021-07-01	a
2021-07-02	b
2021-07-03	c

type of t1.index is DatetimeIndex.
similarly PeriodIndex as follows

t2 = pd.Series(list('def'), [pd.Period('2021-05'), pd.Period('2021-06'), pd.Period('2021-07')])

8.4 converting to Datetime

to_datetime()

d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']
ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=pd.to_datetime(d1), columns=list('ab'))

pd.to_datetime('4.7.12', dayfirst=True)

8.5 Timedelta

pd.Timestamp('9/3/2016') - pd.Timestamp('9/1/2016')

Timedelta('2 days 00:00:00')

pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')

Timestamp('2016-09-14 11:10:00')

8.6 offset

Offset is similar to timedelta, but it follows specific rules. Offset allows flexibility of time intervals. Besides hour, day, week, month, etc it also has business day, end of month, semi month begin etc.

pd.Timestamp('7/1/2021') + pd.offsets.Week()
#add a week ahead
pd.Timestamp('7/1/2021') + pd.offsets.MonthEnd()
#add to the month end

8.7 working with Dates in a dataframe

dates = pd.date_range('7-01-2021', periods=9, freq='3W-Fri') 
#freq can also be '2W-Sun', 'B' for business day, 'QS-JUN' for quarter start in June
df = pd.DataFrame({'Count 1': 100+np.random.randint(-5,10,0).cumsum(), 'Count 2': 120+np.random.randint(-5,10,0)}, index=dates)
df.index.weekday_name #?not understood
df.diff() #?not understood
df.resample('M').mean() #mean for each month
df['2017':] # datetime indexing

9. assignment 3

9.1 clean and join 3 dataframes

datafram Energy

Energy = pd.read_excel("assets/Energy Indicators.xls",skiprows=16,nrows=228,usecols=[2,3,4,5])
Energy = Energy.rename(columns={'Unnamed: 2':'Country','Renewable Electricity Production':'% Renewable'})
Energy = Energy[1:228] #skip line below column names
Energy=Energy.replace('...',np.NaN)
Energy['Energy Supply']=Energy['Energy Supply']*1000000
Energy['Country'] = Energy['Country'].str.rstrip('0123456789') #strip those number at the end
for i in range(1,227):
    if (r'(') in Energy['Country'][i]:
        str2 = re.findall('([\w\s]+)(?=[\s][\(][\w\s]+[\)])',Energy['Country'][i])
        Energy['Country'][i]=str2[0]
#clean those country names with (***)
Energy = Energy.replace({'Country':{"Republic of Korea": "South Korea","United States of America": "United States","United Kingdom of Great Britain and Northern Ireland": "United Kingdom","China, Hong Kong Special Administrative Region": "Hong Kong"}})
#to_check = Energy['Country'].to_dict()

dataframe GDP

GDP = pd.read_csv('assets/world_bank.csv',skiprows=4)
GDP.columns.str.strip()
GDP = GDP.replace({'Country Name':{"Korea, Rep.": "South Korea","Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"}})
#to_check = GDP['Country Name'].to_dict()

dataframe ScimEn

ScimEn = pd.read_excel('assets/scimagojr-3.xlsx')
ScimEn.columns.str.strip()
ScimEn = ScimEn[0:15]
ScimEn = ScimEn.set_index('Country')

join together

df = ScimEn.join(Energy.set_index('Country'), on='Country')
GDP_sub = GDP[['Country Name','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
df = df.join(GDP_sub.set_index('Country Name'), on='Country') #not on Country Name of GDP_sub

joined dataframes

9.2 average GDP

15 countries and their 10years’ average GDP sorted in descending order

df['avgGDP']=df.iloc[:,10:20].mean(axis=1)
df.sort_values('avgGDP',ascending = False)

9.3 which country’s renewable is bigger than med

Create a new column with a 1 if the country’s % Renewable value is at or above the median, 0 below the median

med = df.loc[:,'% Renewable'].median()
df['renew above']=df.loc[:,'% Renewable']>med
df = df['renew above'].replace([True,False],[1,0])

9.4 set_index vs reset_index

ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
df['Continent'] = pd.Series(ContinentDict)
df = df.reset_index()
df = df.set_index('Continent') #without reset in previous line, the original index 'Country' is overwritten via set_index.