Pandas的21-50题_pandas题库及答案-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_44791551/article/details/125449419

本文档演示了如何使用Python的pandas库读取Excel数据，进行数据转换、分组计算、时间格式化、数值统计、数据分桶、数据合并等操作。涉及到的技能包括数据清洗、数据类型转换、数据可视化及数据分组分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

# 21.读取本地EXCEL数据
import numpy as np
import pandas as pd
df = pd.read_excel('pandas120.xlsx')
df

	createTime	education	salary
0	2020-03-16 11:30:18	本科	20k-35k
1	2020-03-16 10:58:48	本科	20k-40k
2	2020-03-16 10:46:39	不限	20k-35k
3	2020-03-16 10:45:44	本科	13k-20k
4	2020-03-16 10:20:41	本科	10k-20k
...	...	...	...
130	2020-03-16 11:36:07	本科	10k-18k
131	2020-03-16 09:54:47	硕士	25k-50k
132	2020-03-16 10:48:32	本科	20k-40k
133	2020-03-16 10:46:31	本科	15k-23k
134	2020-03-16 11:19:38	本科	20k-40k

135 rows × 3 columns

# 22.查看df数据前5行
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   createTime  135 non-null    datetime64[ns]
 1   education   135 non-null    object        
 2   salary      135 non-null    object        
dtypes: datetime64[ns](1), object(2)
memory usage: 3.3+ KB

# 23.将salary列数据转换为最大值与最小值的平均值

# 第一种方法，使用函数
import re
# 方法一：apply + 自定义函数
def func(df):
    lst = df['salary'].split('-')
    smin = int(lst[0].strip('k'))
    smax = int(lst[1].strip('k'))
    df['salary'] = int((smin + smax) / 2 * 1000)
    return df

df = df.apply(func,axis=1)

# 24.将数据根据学历进行分组并计算平均薪资
# 第一种方法
print(df.groupby('education').mean())

# 第二种方法
df.groupby('education').agg({'salary':np.mean})

                 salary
education              
不限         19600.000000
大专         10000.000000
本科         19361.344538
硕士         20642.857143

	salary
education
不限	19600.000000
大专	10000.000000
本科	19361.344538
硕士	20642.857143

# 25.将createTime列时间转换为月-日

## 第一种方法
# for i in range(len(df)):
#         df.iloc[i,0] = df.iloc[i,0].to_pydatetime().strftime("%m-%d")        
# df.head()

## 第二种方法
# df['时间'] = df['createTime'].astype("string")
# df['时间'] = df['时间'].str[5:10]
# df.head()

# 第三种方法
df['createTime'] = df['createTime'].dt.strftime("%m-%d")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   createTime  135 non-null    object
 1   education   135 non-null    object
 2   salary      135 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 3.3+ KB

# 26.查看索引、数据类型和内存信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   createTime  135 non-null    object
 1   education   135 non-null    object
 2   salary      135 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 3.3+ KB

# 27.查看数值型列的汇总统计
df.describe()

	salary
count	135.000000
mean	19159.259259
std	8661.686922
min	3500.000000
25%	14000.000000
50%	17500.000000
75%	25000.000000
max	45000.000000

# 28.新增一列根据salary将数据分为三组
# 数据分桶

bins = [0,5000,20000,50000]
group_names = ['低','中','高']
df['type'] = pd.cut(df['salary'],bins,labels =group_names )
df

	createTime	education	salary	type
0	03-16	本科	27500	高
1	03-16	本科	30000	高
2	03-16	不限	27500	高
3	03-16	本科	16500	中
4	03-16	本科	15000	中
...	...	...	...	...
130	03-16	本科	14000	中
131	03-16	硕士	37500	高
132	03-16	本科	30000	高
133	03-16	本科	19000	中
134	03-16	本科	30000	高

135 rows × 4 columns

# 29.按照salary列对数据降序排列
df.sort_values('salary',ascending=False)

# 将salary进行降序排列，并将其生成列表
a = df['salary'].sort_values(ascending=False).to_list()

# 30.取出第33行数据

# iloc和loc的区别
# 还有更多的经过pandas优化过的选择方式：
# df.loc 通过标签索引行数据
# df.iloc 通过位置获取行数据


# df.iloc[32]
df.loc[32]

createTime    03-16
education        硕士
salary        22500
type              高
Name: 32, dtype: object

# 31.计算salary列的中位数
df['salary'].median()

# 或者网上给予的方法，是调用numpy的中位数方法来进行操作
np.median(df['salary'])

17500.0

# 32.绘制薪资水平频率分布直方图
df['salary'].plot(kind='hist')

在这里插入图片描述

# 33.绘制薪资水平密度曲线
# xlim参数中填写的是横轴的范围。

df['salary'].plot(kind = 'kde',xlim=(0,80000))

在这里插入图片描述

# 34.删除最后一列categories

# 第一种方法直接进行删除
# del df['type']

# 第二种方法，使用drop进行删除
# 1. df= df.drop('column_name', 1)
df = df.drop('type',1)  #删除不改表原始数据

#2.df.drop('columns',axis=1,inplace='True') #改变原始数据

# 35.将df的第一列与第二列合并为新的一列

# 第一种方法，将两列进行直接相加
df['new_columns'] = df['createTime'] + df['education']

# 第二种方法，是使用cat函数进行拼接
df['new_columns'] = df['createTime'].str.cat(df['education'])

# 36.将education列与salary列合并为新的一列

# 第一种方法，使用astype()函数进行数据类型的转换 
df['test'] = df['education'] + df['salary'].astype('str')

# 第二种方法，使用map函数进行每行都进行数据类型的一种转换
df['test'] = df['education'] + df['salary'].map(str)
df

	createTime	education	salary	new_columns	test
0	03-16	本科	27500	03-16本科	本科27500
1	03-16	本科	30000	03-16本科	本科30000
2	03-16	不限	27500	03-16不限	不限27500
3	03-16	本科	16500	03-16本科	本科16500
4	03-16	本科	15000	03-16本科	本科15000
...	...	...	...	...	...
130	03-16	本科	14000	03-16本科	本科14000
131	03-16	硕士	37500	03-16硕士	硕士37500
132	03-16	本科	30000	03-16本科	本科30000
133	03-16	本科	19000	03-16本科	本科19000
134	03-16	本科	30000	03-16本科	本科30000

135 rows × 5 columns

# 37.计算salary最大值与最小值之差
chazhi = df['salary'].max() - df['salary'].min()
chazhi

# 38.将第一行与最后一行拼接
pd.concat([df[:1], df[-2:-1]])

	createTime	education	salary	new_columns	test
0	03-16	本科	27500	03-16本科	本科27500
133	03-16	本科	19000	03-16本科	本科19000

# 39.将第8行数据添加至末尾
df.append(df.iloc[7])

	createTime	education	salary	new_columns	test
0	03-16	本科	27500	03-16本科	本科27500
1	03-16	本科	30000	03-16本科	本科30000
2	03-16	不限	27500	03-16不限	不限27500
3	03-16	本科	16500	03-16本科	本科16500
4	03-16	本科	15000	03-16本科	本科15000
...	...	...	...	...	...
131	03-16	硕士	37500	03-16硕士	硕士37500
132	03-16	本科	30000	03-16本科	本科30000
133	03-16	本科	19000	03-16本科	本科19000
134	03-16	本科	30000	03-16本科	本科30000
7	03-16	本科	12500	03-16本科	本科12500

136 rows × 5 columns

# 40.查看每列的数据类型
df.dtypes

createTime     object
education      object
salary          int64
new_columns    object
test           object
dtype: object

# 41.将createTime列设置为索引
df.set_index("createTime")

	education	salary	new_columns	test
createTime
03-16	本科	27500	03-16本科	本科27500
03-16	本科	30000	03-16本科	本科30000
03-16	不限	27500	03-16不限	不限27500
03-16	本科	16500	03-16本科	本科16500
03-16	本科	15000	03-16本科	本科15000
...	...	...	...	...
03-16	本科	14000	03-16本科	本科14000
03-16	硕士	37500	03-16硕士	硕士37500
03-16	本科	30000	03-16本科	本科30000
03-16	本科	19000	03-16本科	本科19000
03-16	本科	30000	03-16本科	本科30000

135 rows × 4 columns

# 42.生成一个和df长度相同的随机数dataframe
df1 = pd.DataFrame(pd.Series(np.random.randint(1, 10, 135)))
df1

	0
0	3
1	9
2	7
3	5
4	3
...	...
130	2
131	3
132	6
133	3
134	7

135 rows × 1 columns

# 43.将上一题生成的dataframe与df合并
df= pd.concat([df,df1],axis=1)
df

	createTime	education	salary	new_columns	test	0
0	03-16	本科	27500	03-16本科	本科27500	3
1	03-16	本科	30000	03-16本科	本科30000	9
2	03-16	不限	27500	03-16不限	不限27500	7
3	03-16	本科	16500	03-16本科	本科16500	5
4	03-16	本科	15000	03-16本科	本科15000	3
...	...	...	...	...	...	...
130	03-16	本科	14000	03-16本科	本科14000	2
131	03-16	硕士	37500	03-16硕士	硕士37500	3
132	03-16	本科	30000	03-16本科	本科30000	6
133	03-16	本科	19000	03-16本科	本科19000	3
134	03-16	本科	30000	03-16本科	本科30000	7

135 rows × 6 columns

# 44.生成新的一列new为salary列减去之前生成随机数列
df["new"] = df["salary"] - df[0]
df

	createTime	education	salary	new_columns	test	0	new
0	03-16	本科	27500	03-16本科	本科27500	3	27497
1	03-16	本科	30000	03-16本科	本科30000	9	29991
2	03-16	不限	27500	03-16不限	不限27500	7	27493
3	03-16	本科	16500	03-16本科	本科16500	5	16495
4	03-16	本科	15000	03-16本科	本科15000	3	14997
...	...	...	...	...	...	...	...
130	03-16	本科	14000	03-16本科	本科14000	2	13998
131	03-16	硕士	37500	03-16硕士	硕士37500	3	37497
132	03-16	本科	30000	03-16本科	本科30000	6	29994
133	03-16	本科	19000	03-16本科	本科19000	3	18997
134	03-16	本科	30000	03-16本科	本科30000	7	29993

135 rows × 7 columns

# 45.检查数据中是否含有任何缺失值
df.isnull().values.any()

False

# 46.将salary列类型转换为浮点数
df['salary'] = df['salary'].astype('float')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   createTime   135 non-null    object 
 1   education    135 non-null    object 
 2   salary       135 non-null    float64
 3   new_columns  135 non-null    object 
 4   test         135 non-null    object 
 5   0            135 non-null    int32  
 6   new          135 non-null    int64  
dtypes: float64(1), int32(1), int64(1), object(4)
memory usage: 7.0+ KB

# 47.计算salary大于10000的次数
aa = df[df['salary']>10000]

# 第一种方法
aa.shape[0]

# 第二种方法
len(aa)

# 第三种方法
aa.count()[0]

# 48.查看每种学历出现的次数
df['education'].value_counts()

本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64

# 49.查看education列共有几种学历
# 第一种方法
len(df['education'].value_counts())

# 第二种方法
df['education'].nunique()

# 第三种方法
len(df['education'].unique())

# 50.提取salary与new列的和大于60000的最后3行
# 自己做的比较简单的方法
df[df['salary'] + df['new'] >= 60000].iloc[-3:]

# 网上答案 比较麻烦的做法
df1 = df[['salary','new']]
rowsums = df1.apply(np.sum, axis=1)
res = df.iloc[np.where(rowsums > 60000)[0][-3:], :]
res