Datawhale组队学习(Pandas) task2-pandas基础

最新推荐文章于 2025-08-29 01:25:55 发布

原创最新推荐文章于 2025-08-29 01:25:55 发布 · 524 阅读

1 ·

CC 4.0 BY-SA版权

datawhale组队学习同时被 2 个专栏收录

14 篇文章

订阅专栏

code

12 篇文章

订阅专栏

写在前面
看了很多小伙伴task1的笔记，感觉很棒的同时也深受启发，学习过程不仅仅是教材等资料的理解和重复，更应该是自己的思考、串联、发问、尝试，这样才能学得深刻~ 但因为前者更容易，所以自己常常陷入那种效率不太高的努力陷阱中。那以后的打卡笔记就不做一个搬运工+补充工了，多记录自己的思考和尝试。

1. 文件读取和写入

1.1 文件读取

1.1.1 读取`my_csv, my_table, my_excel`文件

df_csv = pd.read_csv('data/my_csv.csv')
df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_txt = pd.read_table('data/my_table.txt')
df_txt

	col1	col2	col3	col4
0	2	a	1.4	apple 2020/1/1
1	3	b	3.4	banana 2020/1/2
2	6	c	2.5	orange 2020/1/5
3	5	d	3.2	lemon 2020/1/7

心得：恰好前两天我同学遇到一个问题：数据读取出来是没空值的，但是用isnull().any()返回的是后两列都是空值。然后我们一看原来是三列数据都读成一列了，加上delimiter=' '就解决了问题。那教程这边给的是可以用sep指定分割参数，查了一下区别：sep分隔符，默认为一个英文逗号’,’ ，delimiter为备选分隔符，如果指定了delimiter则sep失效。

# 教程示例
# 先读出来看看
pd.read_table('data/my_table_special_sep.txt')

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

pd.read_table('data/my_table_special_sep.txt',sep=' |||| ')

F:\Anaconda3\lib\site-packages\pandas\io\parsers.py:767: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  return read_csv(**locals())
F:\Anaconda3\lib\site-packages\pandas\io\parsers.py:2547: FutureWarning: split() requires a non-empty pattern match.
  yield pat.split(line.strip())
F:\Anaconda3\lib\site-packages\pandas\io\parsers.py:2550: FutureWarning: split() requires a non-empty pattern match.
  yield pat.split(line.strip())

			col1	\|\|\|\|	col2
TS	\|\|\|\|	This	is	an	apple.
GQ	\|\|\|\|	My	name	is	Bob.
WT	\|\|\|\|	Well	done!	None	None
PT	\|\|\|\|	May	I	help	you?

心得：参数sep使用的是正则表达式，要注意对|转义变成\|

# 教程示例
pd.read_table('data/my_table_special_sep.txt',sep=' \|\|\|\| ', engine='python')

	col1	col2
0	TS	This is an apple.
1	GQ	My name is Bob.
2	WT	Well done!
3	PT	May I help you?

1.1.2 要求第一行不作为列名、将`'col1', 'col2'`作为索引读`my_csv`文件

pd.read_csv('data/my_csv.csv', header=None, index_col=['col1', 'col2'])

---------------------------------------------------------------------------
ValueError: Index col1 invalid

哦豁报错了，仔细一看我在乱搞嘛，都不让人家作为header了，怎么还能根据列名定义索引~

pd.read_csv('data/my_csv.csv', header=None)

	0	1	2	3	4
0	col1	col2	col3	col4	col5
1	2	a	1.4	apple	2020/1/1
2	3	b	3.4	banana	2020/1/2
3	6	c	2.5	orange	2020/1/5
4	5	d	3.2	lemon	2020/1/7

# 那我偏要让'col1', 'col2'当索引怎么办
pd.read_csv('data/my_csv.csv', header=None, index_col=[0,1])

		2	3	4
0	1
col1	col2	col3	col4	col5
2	a	1.4	apple	2020/1/1
3	b	3.4	banana	2020/1/2
6	c	2.5	orange	2020/1/5
5	d	3.2	lemon	2020/1/7

1.1.3 只读`my_table`文件的前两列和前两行

pd.read_table('data/my_table.txt', usecols=[0,1], nrows=2)

# 或者指定列名 pd.read_table('data/my_table.txt', usecols=['col1', 'col2'])

	col1	col2
0	2	a
1	3	b

1.2 文件写入

1.2.1 利用`index=False`去除无意义索引

df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_csv.to_csv('data/my_csv_saved.csv')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wyW5uK3a-1608378395577)(attachment:image.png)]

df_csv.to_csv('data/my_csv_saved_noindex.csv', index=False)
# 8错子，之前我都是手动删除的...= =

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LZJab6lQ-1608378395581)(attachment:image.png)]

1.2.2 利用`to_csv`保存`txt`文件

# pandas没有定义to_table函数
df_txt.to_csv('data/my_txt_saved.txt', sep='\t', index=False)

Python字符串中的换行符和制表符

\t 横向制表符
\v 纵向制表符
\r 回车

1.2.3 将表格转换为`markdown/latex`语言

df_csv.to_markdown()
# 哇了个噻 好神奇

'|    |   col1 | col2   |   col3 | col4   | col5     |\n|---:|-------:|:-------|-------:|:-------|:---------|\n|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |\n|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |\n|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |\n|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |'

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_csv.to_latex()

'\\begin{tabular}{lrlrll}\n\\toprule\n{} &  col1 & col2 &  col3 &    col4 &      col5 \\\\\n\\midrule\n0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\\\\n1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\\\\n2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\\\\n3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\\\\n\\bottomrule\n\\end{tabular}\n'

2. 基本数据结构

2.1 Series

Series一般组成：

data 序列的值
index 索引
dtype 存储类型
name 序列名字

s = pd.Series(data = [1, 'abc', {'dic1': 1}],
             index =pd.Index(['id1',2,'third'], name='my_idx'),
             dtype = 'object',
             name = 'my_name')

# 取出属性
print(s.values) # 取出值
print(s.index)  # 取出索引
print(s.shape)  # 取出序列长度

# 取出索引对应的值
s['third']

{'dic1': 1}

2.2 DataFrame

DataFrame在Series基础上增加列索引

2.2.1 利用`data+行列索`引构造DF

# 创建dataframe
data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data=data, 
                  index=['row_%d'%i for i in range(3)],
                 columns=['col_0', 'col_1', 'col_2'])
df

	col_0	col_1	col_2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

2.2.2 利用`列索引:数据 + 行索引`引构造DF

df_2 = pd.DataFrame(data={'col_0': [1,2,3], 'col_1':list('abc'), 'col_2':[1,2,3]},
                   index=['row_%d'%i for i in range(3)])
df_2

	col_0	col_1	col_2
row_0	1	a	1
row_1	2	b	2
row_2	3	c	3

# 根据列名取出相应的列
df_2['col_0']

row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

# 取出相应的列的值
df_2[['col_0','col_1']].values

array([[1, 'a'],
       [2, 'b'],
       [3, 'c']], dtype=object)

# 取转置
df_2.T

	row_0	row_1	row_2
col_0	1	2	3
col_1	a	b	c
col_2	1	2	3

3. 常用基本函数

df = pd.read_csv('data/learn_pandas.csv')
df.head()

	School	Grade	Name	Gender	Height	Weight	Transfer	Test_Number	Test_Date	Time_Record
0	Shanghai Jiao Tong University	Freshman	Gaopeng Yang	Female	158.9	46.0	N	1	2019/10/5	0:04:34
1	Peking University	Freshman	Changqiang You	Male	166.5	70.0	N	1	2019/9/4	0:04:20
2	Shanghai Jiao Tong University	Senior	Mei Sun	Male	188.9	89.0	N	2	2019/9/12	0:05:22
3	Fudan University	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N	2	2020/1/3	0:04:08
4	Fudan University	Sophomore	Gaojuan You	Male	174.0	74.0	N	2	2019/11/6	0:05:22

# 取前7列
df = df[df.columns[:7]]

3.1 特征统计函数

3.1.1 求身高和体重的均值和最大值

df_demo = df[['Height','Weight']]  # 记得['Height','Weight']外面的框框

print(df_demo.mean())
print(df_demo.max())

Height    163.218033
Weight     55.015873
dtype: float64
Height    193.9
Weight     89.0
dtype: float64

3.1.2 求身高和体重的非缺失值个数

# 注意是非缺失值个数
df_demo.count()

Height    183
Weight    189
dtype: int64

3.1.3 求身高和体重的最大值对应的索引

df_demo.idxmax()

Height    193
Weight      2
dtype: int64

3.2 唯一值函数

3.2.1 求学校的唯一值列表、唯一值个数、唯一值及出现频数

print('=====一值列表=====')
print(df['School'].unique())
print('=====唯一值个数=====')
print(df['School'].nunique())
print('=====唯一值及出现频数=====')
print(df['School'].value_counts())

=====一值列表=====
['Shanghai Jiao Tong University' 'Peking University' 'Fudan University'
 'Tsinghua University']
=====唯一值个数=====
4
=====唯一值及出现频数=====
Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

3.2.2 多列组合的唯一值

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset : 用来指定特定的列，默认所有列
keep : {first,last,False}, default-first 删除重复项并保留第一次出现的项
inplace : boolean, default False 是直接在原来数据上修改还是保留一个副本

df_demo = df[['Gender','Transfer','Name']]
df_demo.drop_duplicates(['Gender', 'Transfer'])

	Gender	Transfer	Name
0	Female	N	Gaopeng Yang
1	Male	N	Changqiang You
12	Female	NaN	Peng You
21	Male	NaN	Xiaopeng Shen
36	Male	Y	Xiaojuan Qin
43	Female	Y	Gaoli Feng

df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')  # 保留重复行的最后一行

	Gender	Transfer	Name
147	Male	NaN	Juan You
150	Male	Y	Chengpeng You
169	Female	Y	Chengquan Qin
194	Female	NaN	Yanmei Qian
197	Female	N	Chengqiang Chu
199	Male	N	Chunpeng Lv

df_demo.drop_duplicates(['Name', 'Gender'], keep=False)  # 重复的都删掉

	Gender	Transfer	Name
0	Female	N	Gaopeng Yang
1	Male	N	Changqiang You
2	Male	N	Mei Sun
4	Male	N	Gaojuan You
5	Female	N	Xiaoli Qian
...	...	...	...
194	Female	NaN	Yanmei Qian
196	Female	N	Li Zhao
197	Female	N	Chengqiang Chu
198	Male	N	Chengmei Shen
199	Male	N	Chunpeng Lv

164 rows × 3 columns

duplicated返回是否为唯一值的布尔列表，重复元素为True，否则为False

df_demo.duplicated(['Gender', 'Transfer']).head()

0    False
1    False
2     True
3     True
4     True
dtype: bool

3.3 替换函数

3.3.1 将女性/男性分别用0/1替换

df['Gender'].replace({'Female':0, 'Male':1}).head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

df['Gender'].replace(['Female','Male'], [0,1]).head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

3.3.2 `ffill/bfill`替换

df.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='ffill')

to_replace：被替换的值
value：替换后的值
inplace：是否要改变原数据，False是不改变，True是改变，默认是False
limit：控制填充次数
regex：是否使用正则,False是不使用，True是使用，默认是False
method：填充方式，pad,ffill,bfill分别是向前、向前、向后填充

s = pd.Series(['a', 1, 'b', 2, 3, 4, 'a'])
s

0    a
1    1
2    b
3    2
4    3
5    4
6    a
dtype: object

s.replace([1,2], method='ffill')

0    a
1    a
2    b
3    b
4    3
5    4
6    a
dtype: object

s.replace([1,2], method='bfill')

0    a
1    b
2    b
3    3
4    3
5    4
6    a
dtype: object

3.3.3 `where/mask`逻辑替换

where在传入条件为False的对应行替换
mask在传入条件为True的对应行替换

s = pd.Series([-1, 1.234, 100, -50])
s.where(s<0, 100)  # 在传入条件为False的对应行替换

0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64

s.mask(s<0, 50)  # 在传入条件为True的对应行替换

0     50.000
1      1.234
2    100.000
3     50.000
dtype: float64

s_condition= pd.Series([True,False,False,True],index=s.index)
s.mask(s_condition, -50)

0    -50.000
1      1.234
2    100.000
3    -50.000
dtype: float64

# 在 clip 中，超过边界的只能截断为边界值，如果要把超出边界的替换为自定义的值，应当如何做？
print(s.mask(s.clip(0,2) != s, 666))

# 在实际应用场景中有遇到过这类问题，当时的自己都是先判断是不是落在范围外，然后赋值
# 现在这样的写法真的很好，学到了~

0    666.000
1      1.234
2    666.000
3    666.000
dtype: float64

3.4 排序函数

# 多级排序
df = pd.read_csv('data/learn_pandas.csv')
df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade', 'Name'])
df_demo

		Height	Weight
Grade	Name
Freshman	Gaopeng Yang	158.9	46.0
Freshman	Changqiang You	166.5	70.0
Senior	Mei Sun	188.9	89.0
Sophomore	Xiaojuan Sun	NaN	41.0
Sophomore	Gaojuan You	174.0	74.0
...	...	...	...
Junior	Xiaojuan Sun	153.9	46.0
Senior	Li Zhao	160.9	50.0
	Chengqiang Chu	153.9	45.0
	Chengmei Shen	175.3	71.0
Sophomore	Chunpeng Lv	155.7	51.0

200 rows × 2 columns

# 对身高排序，默认为升序
df_demo.sort_values('Height').head()

		Height	Weight
Grade	Name
Junior	Xiaoli Chu	145.4	34.0
Senior	Gaomei Lv	147.3	34.0
Sophomore	Peng Han	147.8	34.0
Senior	Changli Lv	148.7	41.0
Sophomore	Changjuan You	150.5	40.0

#多列排序，体重相同情况下对身高排序。身高降序排列，体重升序排列
df_demo.sort_values(['Weight','Height'],ascending=[True,False])

		Height	Weight
Grade	Name
Sophomore	Peng Han	147.8	34.0
Senior	Gaomei Lv	147.3	34.0
Junior	Xiaoli Chu	145.4	34.0
Sophomore	Qiang Zhou	150.5	36.0
Freshman	Yanqiang Xu	152.4	38.0
...	...	...	...
Junior	Qiang Sun	160.8	NaN
Senior	Qiang Shi	157.7	NaN
Senior	Mei Chen	153.6	NaN
Sophomore	Yanfeng Han	NaN	NaN
Junior	Chengli Zhao	NaN	NaN

200 rows × 2 columns

# 索引排序，需要指定索引层的名字或层号
df_demo.sort_index(level=['Grade','Name'], ascending=[True, False]).head()

		Height	Weight
Grade	Name
Freshman	Yanquan Wang	163.5	55.0
	Yanqiang Xu	152.4	38.0
	Yanqiang Feng	162.3	51.0
	Yanpeng Lv	NaN	65.0
	Yanli Zhang	165.1	52.0

3.5 `apply`方法

apply方法常用于DataFrame的行迭代或者列迭代

# 只有在确实存在自定义需求的情境下才考虑使用apply函数
df_demo = df[['Height', 'Weight']]
def my_mean(x):
    res = x.mean()
    return res

df_demo.apply(my_mean)

Height    163.218033
Weight     55.015873
dtype: float64

# mad 函数返回的是一个序列中偏离该序列均值的绝对值大小的均值
# 如[1,3,7,10] 均值=3.25，每个元素偏离的绝对值为[4.25,2.25,1.75,4.75]，mad返回的均值为3.25
df_demo.apply(lambda x:(x-x.mean()).abs().mean())

Height     6.707229
Weight    10.391870
dtype: float64

df_demo.mad()

Height     6.707229
Weight    10.391870
dtype: float64

4. 窗口对象

滑动窗口rolling
扩张窗口expanding
指数加权窗口ewm

4.1 滑窗对象

4.1.1 滑窗函数

# 先对序列使用.rolling得到滑窗对象
s = pd.Series([1,2,3,4,5])
roller = s.rolling(window=3)
roller

Rolling [window=3,center=False,axis=0]

# 对滑窗对象进行各种计算，注意 窗口包含当前行所在的元素
roller.sum()

0     NaN
1     NaN
2     6.0
3     9.0
4    12.0
dtype: float64

s2 = pd.Series([1,2,6,16,30])
# 计算协方差
roller.cov(s2)

0     NaN
1     NaN
2     2.5
3     7.0
4    12.0
dtype: float64

# 计算相关系数
roller.corr(s2)

0         NaN
1         NaN
2    0.944911
3    0.970725
4    0.995402
dtype: float64

4.1.2 类滑窗函数

shift 取向前第 n 个元素的值
diff 与向前第 n 个元素做差
pct_change 与向前第 n 个元素相比计算增长率

s = pd.Series([1,3,6,10,15])

# 取向前第 n 个元素的值
s.shift(2)

0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64

# 与窗口为n+1的rolling方法等价
s.rolling(3).apply(lambda x:list(x)[0])

0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64

# 与向前第 n 个元素做差
s.diff(2)

0    NaN
1    NaN
2    5.0
3    7.0
4    9.0
dtype: float64

# 与窗口为n+1的rolling方法等价
s.rolling(3).apply(lambda x:list(x)[-1]-list(x)[0])

0    NaN
1    NaN
2    5.0
3    7.0
4    9.0
dtype: float64

# rolling 对象的默认窗口方向都是向前的，某些情况下用户需要向后的窗口，例如对1,2,3设定向后窗口为2的 sum 操作，结果为3,5,NaN，此时应该如何实现向后的滑窗操作？
s_demo = pd.Series([1,2,3])
s_demo.rolling(2).sum()

0    NaN
1    3.0
2    5.0
dtype: float64

s_demo.shift(2)

0    NaN
1    NaN
2    1.0
dtype: float64

s_demo.rolling(3).apply(lambda x:list(x)[0])

0    NaN
1    NaN
2    1.0
dtype: float64

4.1.3 练一练

rolling 对象的默认窗口方向都是向前的，某些情况下用户需要向后的窗口，例如对1,2,3设定向后窗口为2的 sum 操作，结果为3,5,NaN，此时应该如何实现向后的滑窗操作？（提示：使用 shift ）

看到这个练一练，想了半天把自己绕进去了，然后上文讲的rolling和shift就混杂在一起，发现自己根本没有真正理解，画了张图理顺思路，注意rolling的窗是3，shift跳的步数是2：
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OvEaqnd8-1608378395588)(attachment:1.PNG)]

s = pd.Series([1,2,3])
s.shift(-1)+s

0    3.0
1    5.0
2    NaN
dtype: float64

4.2 扩张窗口

# 使用的聚合函数会作用于这些逐步扩张的窗口上
s = pd.Series([1, 3, 6, 10])
s.expanding().mean()

0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64

4.2.1 累计最大值

s.cummax()  # 累计最大值

0     1
1     3
2     6
3    10
dtype: int64

s.expanding().max()

0     1.0
1     3.0
2     6.0
3    10.0
dtype: float64

4.2.2 累计求和

s.cumsum()  # 累加

0     1
1     4
2    10
3    20
dtype: int64

s.expanding().sum()

0     1.0
1     4.0
2    10.0
3    20.0
dtype: float64

4.2.3 累乘

s.cumprod()  # 累乘

0      1
1      3
2     18
3    180
dtype: int64

from functools import reduce
reduce(lambda x,y:x*y,s.expanding())  # 我死了...还是不对..

0    1.0
1    NaN
2    NaN
3    NaN
dtype: float64

def multi(mylist):
    result=1
    for i in mylist:
        result=result*i
    return result

s.expanding().apply(multi)  # 终于对了哈哈哈哈哈，是在第二道练习题答案的启发下，原来可以这样用~

0      1.0
1      3.0
2     18.0
3    180.0
dtype: float64

5. 练习

5.1 口袋妖怪数据集

5.1.1 第一题

题目：对 HP, Attack, Defense, Sp. Atk, Sp. Def, Speed 进行加总，验证是否为 Total 值

我的答案

df = pd.read_csv('data/pokemon.csv')
(df.iloc[:,5:11].sum(axis=1)==df['Total']).all()

5.1.2 第二题

题目a：求第一属性的种类数量和前三多数量对应的种类

我的答案

# df.drop_duplicates('#','first') # 去重后 从800行--->721行
df.drop_duplicates('#','first', inplace=True)  # 在原表上修改

# 2(a)求第一属性的种类数量--->唯一值个数
df['Type 1'].nunique()

# 2(a)求前三多数量对应的种类--->唯一值及出现频数
df['Type 1'].value_counts().head(3)

题目b：求第一属性和第二属性的组合种类

# 先把第二种类空值删掉
# 求组合的唯一值个数
df_dropna = df.dropna(subset=['Type 2'])  # 350行
df_dropna.drop_duplicates(['Type 1', 'Type 2'],'first', inplace=True)  # 125行

题目c：求尚未出现过的属性组合

# 求现有种类组合
cata = zip(df_dropna['Type 1'],df_dropna['Type 2'])
cata = list(cata)
cata

# 先看看两种属性各自都有哪些，刚开始以为两种属性各自的候选集都是不同的，结果取差集发现为空。即唯一值种类相同。
type_one = df['Type 1'].unique().tolist()
type_two = df_dropna['Type 2'].unique().tolist()
[item for item in type_one if item not in type_two]  # []

# 列出全部组合，利用permutations()进行排列
import itertools
total_cata = list(itertools.permutations(type_one, 2))

np.array([item for item in total_cata if item not in cata])
>>> (181,2 )

参考答案

参考答案想的太全面了，单属性的组合方式是我没有考虑到的
最开始写的时候把数据类型写错了，导致返回差集错误

L_full = [i+' '+j for i in df['Type 1'].unique() for j in (df['Type 1'].unique().tolist() + [''])]  # + ['']表示单属性组合
L_part = [i+' '+j for i, j in zip(df['Type 1'], df['Type 2'].replace(np.nan, ''))]
res = set(L_full).difference(set(L_part))
len(res)  >>> 188

5.1.3 第三题

题目a：取出物攻，超过120的替换为 high ，不足50的替换为 low ，否则设为 mid

我的答案

attack = pd.Series(df['Attack'])
attack = attack.mask(attack.values<50, 'low').mask(attack.values>120, 'high').mask((attack.values<=120)&(attack.values>=50),'mid')
attack.head()

0    low
1    mid
2    mid
4    mid
5    mid
Name: Attack, dtype: object

题目b：取出第一属性，分别用 replace 和 apply 替换所有字母为大写

我的答案

df['Type 1']

0        Grass
1        Grass
2        Grass
3        Grass
4         Fire
        ...   
795       Rock
796       Rock
797    Psychic
798    Psychic
799       Fire
Name: Type 1, Length: 800, dtype: object

# 3(b) 取出第一属性，分别用 replace 和 apply 替换所有字母为大写
# replace 写法
df['Type 1'].replace({i: str.upper(i) for i in df['Type 1']})  
# apply 写法
df['Type 1'].apply(lambda x:str.upper(x))

0        GRASS
1        GRASS
2        GRASS
3        GRASS
4         FIRE
        ...   
795       ROCK
796       ROCK
797    PSYCHIC
798    PSYCHIC
799       FIRE
Name: Type 1, Length: 800, dtype: object

题目c：求每个妖怪六项能力的离差，即所有能力中偏离中位数最大的值，添加到 df 并从大到小排序

我的答案

# 3(c) 求每个妖怪六项能力的离差，即所有能力中偏离中位数最大的值，添加到 df 并从大到小排序
df_ability = df.iloc[:,5:11]
df['result'] = df_ability.apply(lambda x:np.max((x-x.median()).abs()),1)
df.sort_values('result', ascending=False)

5.2 指数加权窗口

5.2.1 用 expanding 窗口实现ewm

参考答案

s = pd.Series(np.random.randint(-1,2,30).cumsum())

def myewm(x, alpha=0.2):
    # 我知道要定义函数 但根本不会写构造方式.. /(ㄒoㄒ)/~~
    # 生成几幂次
    r = np.arange(x.shape[0])[::-1]   #  [::-1] 开始-结束-步长，步长<0 表示从右向左走
    # 分子中的某项
    win = (1-alpha)**r
    # 分子/分母
    res = (win*x).sum()/win.sum()
    return res
s.expanding().apply(myewm).head()

0   -1.000000
1   -1.555556
2   -1.737705
3   -1.487805
4   -1.342694
dtype: float64

5.2.2 作为滑动窗口的 ewm 窗口

我的答案

# 不用改公式的对吧，就先滑窗 然后根据滑出来的结果应用函数就好了呀
s.rolling(window=3).apply(myewm)

0          NaN
1          NaN
2    -1.737705
3    -1.590164
4    -1.262295
5    -1.000000

写在后面
这期有点令人头秃，感觉自己看了好久好久…好久。头秃是真的，学习占用了好多时间是真的，不过感觉自己对pandas不再畏惧了也是真的。庆幸自己在一番纠结之后报名了这次组队学习，如果不是打卡ddl逼着可能自己也没有这么强动力学完，坚持下去善始善终~

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

Datawhale组队学习(Pandas) task2-pandas基础

目录

1. 文件读取和写入

1.1 文件读取

1.1.1 读取my_csv, my_table, my_excel文件

1.1.2 要求第一行不作为列名、将'col1', 'col2'作为索引读my_csv文件

1.1.3 只读my_table文件的前两列和前两行

1.2 文件写入

1.2.1 利用index=False去除无意义索引

1.2.2 利用to_csv保存txt文件

1.2.3 将表格转换为markdown/latex语言

2. 基本数据结构

2.1 Series

2.2 DataFrame

2.2.1 利用data+行列索引构造DF

2.2.2 利用列索引:数据 + 行索引引构造DF

3. 常用基本函数

3.1 特征统计函数

3.1.1 求身高和体重的均值和最大值

3.1.2 求身高和体重的非缺失值个数

3.1.3 求身高和体重的最大值对应的索引

3.2 唯一值函数

3.2.1 求学校的唯一值列表、唯一值个数、唯一值及出现频数

3.2.2 多列组合的唯一值

3.3 替换函数

3.3.1 将女性/男性分别用0/1替换

3.3.2 ffill/bfill替换

3.3.3 where/mask逻辑替换

3.4 排序函数

3.5 apply方法

4. 窗口对象

4.1 滑窗对象

4.1.1 滑窗函数

4.1.2 类滑窗函数

4.1.3 练一练

4.2 扩张窗口

4.2.1 累计最大值

4.2.2 累计求和

4.2.3 累乘

5. 练习

5.1 口袋妖怪数据集

5.1.1 第一题

5.1.2 第二题

5.1.3 第三题

5.2 指数加权窗口

5.2.1 用 expanding 窗口实现ewm

5.2.2 作为滑动窗口的 ewm 窗口

6 条评论

1.1.1 读取`my_csv, my_table, my_excel`文件

1.1.2 要求第一行不作为列名、将`'col1', 'col2'`作为索引读`my_csv`文件

1.1.3 只读`my_table`文件的前两列和前两行

1.2.1 利用`index=False`去除无意义索引

1.2.2 利用`to_csv`保存`txt`文件

1.2.3 将表格转换为`markdown/latex`语言

2.2.1 利用`data+行列索`引构造DF

2.2.2 利用`列索引:数据 + 行索引`引构造DF

3.3.2 `ffill/bfill`替换

3.3.3 `where/mask`逻辑替换

3.5 `apply`方法