pandas的基本操作（三）

最新推荐文章于 2025-12-20 09:26:15 发布

原创最新推荐文章于 2025-12-20 09:26:15 发布 · 278 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #数据分析

Pandas 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍使用Python的Pandas库进行数据预处理的方法，包括缺失值处理、重复项删除等核心操作，适用于数据清洗与准备阶段。

pandas的基本操作（三）

代码参考自《python3人工智能入门到实战破冰》

1. 缺失值处理删除和替换

首先，创建一个DataFrame类型的数据，并且赋予缺失值。

import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randint(1,10,[5,3]),index=['a','c','e','f','h'],columns=['one','two','three'])
df.loc['a','one']=np.nan
df.loc['c','two']=np.nan
df.loc['c','three']=np.nan
df.loc['a','two']=np.nan
df['four']='bar'
df['five']=df['one']>0
df2=df.reindex(['a','b','c','d','e','f','g','h'])
print(df2)

看一下输出结果：

   one  two  three four   five
a  NaN  NaN    8.0  bar  False
b  NaN  NaN    NaN  NaN    NaN
c  2.0  NaN    NaN  bar   True
d  NaN  NaN    NaN  NaN    NaN
e  5.0  2.0    2.0  bar   True
f  1.0  3.0    3.0  bar   True
g  NaN  NaN    NaN  NaN    NaN
h  3.0  9.0    4.0  bar   True

有好多的NaN值。

2. 缺失值丢弃（删除）

dropna()函数
1 删除缺失值所在的行（列）。对，你没看错，就是整行整列的删除。可选参数axis=0,1 。
0代表删除行，1代表删除列，默认0。
删除缺失行效果 print(df2.dropna(axis=0))

   one   two  three four  five
c  6.0 -90.0  -80.0  bar  True
e  2.0   1.0    7.0  bar  True
f  7.0   9.0    1.0  bar  True
h  4.0   2.0    3.0  bar  True

删除缺失列效果

Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]

每一列都有缺失值，所有全部数据删除。
2 删除一行中全部为NaN的元素，只有一行全部为NaN，才删除。dropna(how=‘all’)

3 设置阈值，删除字段中属性值小于4的行。什么意思呢？就是非NaN值的数量不小于4的行。

print(df2.dropna(thresh=4))

结果：

   one    two  three four   five
a  NaN -100.0    9.0  bar  False
c  5.0  -90.0  -80.0  bar   True
e  5.0    8.0    9.0  bar   True
f  4.0    4.0    9.0  bar   True
h  9.0    7.0    4.0  bar   True

4 删除指定列值为空数据的行 print(df2.dropna(subset=['one','five']))

   one   two  three four  five
c  1.0 -90.0  -80.0  bar  True
e  1.0   3.0    5.0  bar  True
f  5.0   1.0    1.0  bar  True
h  3.0   2.0    5.0  bar  True

3. 替换（填充）

fillna() 将空值赋值为指定值
例如：print(df2.fillna(1))将所有空值替换为1，还可以尝试替换为其他。
结果：

   one    two  three four   five
a  1.0 -100.0    9.0  bar  False
b  1.0    1.0    1.0    1      1
c  7.0  -90.0  -80.0  bar   True
d  1.0    1.0    1.0    1      1
e  8.0    6.0    4.0  bar   True
f  4.0    8.0    1.0  bar   True
g  1.0    1.0    1.0    1      1
h  8.0    7.0    7.0  bar   True

重点来了,replace()函数可将指定值替换。例如，我们将-100 替换为+100。print(df2.replace(-100,100))

   one    two  three four   five
a  NaN  100.0    7.0  bar  False
b  NaN    NaN    NaN  NaN    NaN
c  5.0  -90.0  -80.0  bar   True
d  NaN    NaN    NaN  NaN    NaN
e  9.0    1.0    9.0  bar   True
f  9.0    9.0    7.0  bar   True
g  NaN    NaN    NaN  NaN    NaN
h  1.0    7.0    5.0  bar   True

当然，我们还可以通过前面说过的，选的行列。替换指定位置的数据。

4. 重复项删除

duplicated() 和 drop_duplicates()。

duplicated() 返回值为布尔向量，并指示是否重复
drop_duplicates() 删除重复的行

print(df2.duplicated())
print(df2.drop_duplicates())

结果：

a    False
b    False
c    False
d     True
e    False
f    False
g     True
h    False
dtype: bool
   one    two  three four   five
a  NaN -100.0    4.0  bar  False
b  NaN    NaN    NaN  NaN    NaN
c  4.0  -90.0  -80.0  bar   True
e  1.0    1.0    8.0  bar   True
f  4.0    2.0    9.0  bar   True
h  1.0    7.0    2.0  bar   True