python中pandas用法总结

最新推荐文章于 2025-10-21 19:42:02 发布

原创最新推荐文章于 2025-10-21 19:42:02 发布 · 862 阅读

10 ·

CC 4.0 BY-SA版权

Python 专栏收录该内容

32 篇文章

订阅专栏

本文详细介绍了Pandas和NumPy在数据处理中的关键应用，包括数据结构Series和DataFrame的创建、索引操作、数据筛选、统计分析、数据读写、缺失值处理、重复值检查、数据整合与连接等，旨在帮助读者掌握高效的数据分析技能。

# 这两个模块密不可分
In [1]: import numpy as np

In [2]: import pandas as pd

pandas有两个主要数据结构：Series和DataFrame。

Series

Series类似于一维字典的对象，是由一组数据和其对应的一组标签组成，即index和values组成，可以通过索引选取Series中的若干个元素。

Series的创建：

函数实现：pd.Series(list,index=[ ])

In [3]: l = ['a', 'b', 'c', 'd', 'e']

In [4]: data = pd.Series(np.arange(5), l) # Series中第一个参数代表values数值，第二个参数代表是index索引。

In [5]: data
Out[5]:
a    0
b    1
c    2
d    3
e    4
dtype: int32

In [6]: data = pd.Series(l, np.arange(5))

In [7]: data
Out[7]:
0    a
1    b
2    c
3    d
4    e
dtype: object

obj.reindex([ ])：修改索引。

obj.drop( )：删除某一索引，里边可以是列表可以是单独的元素。默认删除行，如果想要删除列，参数为axis =1或者axis='columns'。参数存在inplace=true则obj.drop()不返回新的对象，直接在原本对象更新数据。

DataFrame

DataFrame是一种既有行索引又有列索引的二维字典。

DataFrame的创建：

函数实现：pd.DataFrame(data,columns = [ ],index = [ ])，其中columns和index为指定的列、行索引，并按照顺序排列。

In [50]: col = ['math', 'Chinese', 'English']

In [47]: score = np.random.randint(60, 100, (4, 3)) # 生成60到100的4行乘3列的整数序列

In [28]: index = ['Bob', 'Lily', 'Judy', 'Cindy']

In [48]: data = pd.DataFrame(score, columns=col, index=index)

In [52]: data
Out[52]:
       math  Chinese  English
Bob      69       69       66
Lily     70       70       87
Judy     61       75       75
Cindy    98       87       71

data.loc()函数可以从DataFrame中选出数组的行和列的子集：

In [66]: data.loc['Bob']
Out[66]:
math       69
Chinese    69
English    66
Name: Bob, dtype: int32

In [70]: data.loc['Cindy', ['Chinese', 'math']]
Out[70]:
Chinese    87
math       98
Name: Cindy, dtype: int32

data.iloc()函数可以从DataFrame中选出数组的行和列的子集，将元素以整数标签的形式表示：

In [71]: data.iloc[3, [0, 1]]
Out[71]:
math       98
Chinese    87
Name: Cindy, dtype: int32

data['math']这种写法可以直接读取列中的元素。

obj.sort_index()：适用于Series和DataFrame，对数列中的索引index进行排序。axis=1表示对行索引进行排序。

data.sum()表示列求和，data.sum(axis=1)表示行求和。

data.mean()表示列求均值，data.mean(axis=1)表示行求均值。

data.describe()表示它一次性产生多个汇总统计。

data.unique()表示显示去掉values中重复的元素，输出其唯一值。

data.value_counts()输出每个values元素的个数。

a.isin()：判断成员中是否有isin中存在的数据：

In [105]: a = pd.Series([1,2,3,4,2,3,5,2])

In [106]: m = a.isin([1,3,5])

In [107]: m
Out[107]:
0     True
1    False
2     True
3    False
4    False
5     True
6     True
7    False
dtype: bool

data = pd.read_csv(" ")：从某一个路径中读取csv文件，存到变量data中。里边加入skiprows参数，可以选择要入读的行。加入nrows=表示读取前多少行。

data.to_csv(" ")：将data中的数据以csv的格式存储到指定路径中。存入时，里边的参数index=False表示存储行向索引值，有的时候，将行向索引值存到数据文件中，容易出错。index=True表示将行向索引值存入到数据文件中。

a.isnull()判断是否某一个位置存在数据缺失，返回bool类型：

In [110]: a = pd.Series([1,2,3,4,np.nan,3,np.nan,2])

In [111]: a
Out[111]:
0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    3.0
6    NaN
7    2.0
dtype: float64

In [112]: a.isnull()
Out[112]:
0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

a.dropna()表示过滤到缺失值：
在DataFrame中，只要调用此方法，某一行只要存在NaN则整行全部过滤掉；若方法中含有参数how='all'则表示某一行中全部为NaN时才会过滤掉该行；参数中包含axis=1则表示过滤掉列，原则和上面的相同；参数是thresh=表示按照自己的意愿删除某一行。

In [113]: a.dropna()
Out[113]:
0    1.0
1    2.0
2    3.0
3    4.0
5    3.0
7    2.0
dtype: float64

data.fillna()表示将缺失的值补全：
fillna()里边可以填入字典。

In [114]: a
Out[114]:
0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    3.0
6    NaN
7    2.0
dtype: float64

In [115]: a.fillna(10)
Out[115]:
0     1.0
1     2.0
2     3.0
3     4.0
4    10.0
5     3.0
6    10.0
7     2.0
dtype: float64

data.duplicated()返回bool类型，判断Series或者DataFrame中的重复项，存在重复项，重复的部分返回True，没有重复的返回False。

data.drop_duplicates()只显示没有重复的部分，即显示data.duplicated()函数中显示为False的部分元素。

In [137]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [135]: bins = [18, 25, 35, 60, 100]

In [138]: cats = pd.cut(ages, bins)

In [139]: cats
Out[139]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [142]: cats.codes
Out[142]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [143]: cats.categories
Out[143]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

字符串的一些操作：

split()根据函数中的参数，对字符串进行划分：

In [151]: a = "I,like,  LeetCode"

In [152]: a
Out[152]: 'I,like,  LeetCode'

In [153]: b = a.split(',')

In [154]: b
Out[154]: ['I', 'like', '  LeetCode']

In [156]: pieces = [x.strip() for x in b]

In [157]: pieces
Out[157]: ['I', 'like', 'LeetCode']

join()：字符串连接：

In [161]: a = ['I', ' like', ' LeetCode']

In [162]: b = ''.join(a)

In [163]: b
Out[163]: 'I like LeetCode'

merge()函数的用法：

merge()函数的作用是根据连个Series或者是DataFrame中的某个或某些键值（columns），将两个数据结构连接起来。

最好是显示的指定，否则必须有公共相同的键值索引，否则会报错。

In [164]: a = pd.DataFrame({'key' : ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data' : np.arange(7)})

In [165]: a
Out[165]:
   data key
0     0   b
1     1   b
2     2   a
3     3   c
4     4   a
5     5   a
6     6   b

In [166]: b = pd.DataFrame({'key' : ['a', 'b', 'c'], 'data2': [0, 1, 2] })

In [167]: b
Out[167]:
   data2 key
0      0   a
1      1   b
2      2   c

In [169]: c = pd.merge(a, b, on='key')

In [170]: c
Out[170]:
   data key  data2 # 第一列是a中的数据，第二列是公共数据，第三列是b中的数据
0     0   b      1
1     1   b      1
2     6   b      1
3     2   a      0
4     4   a      0
5     5   a      0
6     3   c      2

如果两个数据中均没有相同名字的键值，则需要在merge()函数中添加，left_on和right_on：
pd.merge(a, b, left_on='key', right_on='key2')

根据键值合并：

使用merge(a, b, left_index=True)可以表示将左边（右边）的数据中的索引作为合并的键值。

In [184]: left1 = pd.DataFrame({'key' : ['a', 'b', 'a', 'a', 'b', 'c'], 'value' : range(6)})

In [185]: right1 = pd.DataFrame({'group_val' : [3.5, 7]}, index = ['a', 'b'])

In [186]: left1
Out[186]:
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

In [187]: right1
Out[187]:
   group_val
a        3.5
b        7.0

# 左边数据是key键值，右边是数据是索引值
In [188]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[188]:
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0

np模块中的concatenate()函数，可以实现numpy类型数组的整合，拼接，互换等操作。

In [190]: a = np.arange(9).reshape((3, 3))

In [191]: a
Out[191]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [192]: np.concatenate([a, a])
Out[192]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8],
       [0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [193]: np.concatenate([a, a], axis=1)
Out[193]:
array([[0, 1, 2, 0, 1, 2],
       [3, 4, 5, 3, 4, 5],
       [6, 7, 8, 6, 7, 8]])

In [194]: np.concatenate([a * a], axis=1)
Out[194]:
array([[ 0,  1,  4],
       [ 9, 16, 25],
       [36, 49, 64]])

concat()函数用法：

pandas中的concat()函数，可以轴向生成数据：

注意concat中的形式是：([a, b])即以列表的形式表示。

In [61]: s1 = pd.Series([0, 1], index=['a', 'b'])

In [62]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [63]: s3 = pd.Series([5, 6], index=['f', 'g'])

In [64]: pd.concat([s1, s2, s3]) # 轴向生成
Out[64]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

# 横向连接
In [205]: pd.concat([s1, s2, s3], axis=1, sort=True)
Out[205]:
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

stack()函数用法：

stack()函数将列中的数据透视到行，unstack()函数将行中的数据透视到列。


In [234]: data # 原始数据
Out[234]:
       math  Chinese  English
Bob      69       69       66
Lily     70       70       87
Judy     61       75       75
Cindy    98       87       71

In [239]: a = data.stack()

In [240]: a
Out[240]:
Bob    math       69
       Chinese    69
       English    66
Lily   math       70
       Chinese    70
       English    87
Judy   math       61
       Chinese    75
       English    75
Cindy  math       98
       Chinese    87
       English    71
dtype: int32

In [241]: a.unstack() # 与上面的stack()函数互逆
Out[241]:
       math  Chinese  English
Bob      69       69       66
Lily     70       70       87
Judy     61       75       75
Cindy    98       87       71