总结pandas

最新推荐文章于 2025-01-15 01:00:00 发布

原创最新推荐文章于 2025-01-15 01:00:00 发布 · 185 阅读

1 ·

CC 4.0 BY-SA版权

多元统计同时被 2 个专栏收录

1 篇文章

订阅专栏

pandas

1 篇文章

订阅专栏

本文详细介绍Pandas库在数据处理中的应用，包括数据查看、选择、空值处理及统计操作。涵盖数据切片、筛选、填充及转换技巧，适合数据分析初学者及进阶者。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

总结pandas

一、查看数据

查看frame中头部和尾部的行：
具体可以查看 pandas 0.25.0 documentation

data1 = data.head(6)  # 取前六行数据
data2 = data.tail(6)  # 取后六行数据
print(data1)
print('----'*50)
print(data2)

	   Rank       City            State Population Date of census/estimate
0     1  London[2]   United Kingdom  8,615,246                1-Jun-14
1     2     Berlin          Germany  3,437,916               31-May-14
2     3     Madrid            Spain  3,165,235                1-Jan-14
3     4       Rome            Italy  2,872,086               30-Sep-14
4     5      Paris           France  2,273,305                1-Jan-13
5     6  Bucharest          Romania  1,883,425               20-Oct-11

---------------------------------------------------------------------------

     Rank        City            State Population Date of census/estimate
99    100  Valladolid            Spain    311,501                1-Jan-12
100   101        Bonn          Germany    309,869               31-Dec-12
101   102       Malm枚           Sweden    309,105               31-Mar-13
102   103  Nottingham   United Kingdom    308,735               30-Jun-12
103   104    Katowice           Poland    308,269               30-Jun-12
104   105      Kaunas        Lithuania    306,888                1-Jan-13

显示索引、列和底层的numpy数据：

print(data.index)
print('--'*40)
print(data.columns)
print('--'*40)
print(data.values)

RangeIndex(start=0, stop=105, step=1)
--------------------------------------------------------------------------------
Index(['Rank', 'City', 'State', 'Population', 'Date of census/estimate'], dtype='object')
--------------------------------------------------------------------------------
[[1 'London[2]' ' United Kingdom' '8,615,246' '1-Jun-14']
 [2 'Berlin' ' Germany' '3,437,916' '31-May-14']
 [3 'Madrid' ' Spain' '3,165,235' '1-Jan-14']
 [4 'Rome' ' Italy' '2,872,086' '30-Sep-14']
 [5 'Paris' ' France' '2,273,305' '1-Jan-13']
...
 [101 'Bonn' ' Germany' '309,869' '31-Dec-12']
 [102 'Malm枚' ' Sweden' '309,105' '31-Mar-13']
 [103 'Nottingham' ' United Kingdom' '308,735' '30-Jun-12']
 [104 'Katowice' ' Poland' '308,269' '30-Jun-12']
 [105 'Kaunas' ' Lithuania' '306,888' '1-Jan-13']]

describe()函数对于数据的快速统计汇总:

print(data.describe())

             Rank
count  105.000000
mean    53.057143
std     30.428298
min      1.000000
25%     27.000000
50%     53.000000
75%     79.000000
max    105.000000

对数据的转置(行列位置互换)：

print(data)
print('--'*40)
print(data.T)

     Rank        City            State Population Date of census/estimate
0       1   London[2]   United Kingdom  8,615,246                1-Jun-14
1       2      Berlin          Germany  3,437,916               31-May-14
2       3      Madrid            Spain  3,165,235                1-Jan-14
3       4        Rome            Italy  2,872,086               30-Sep-14
4       5       Paris           France  2,273,305                1-Jan-13
..    ...         ...              ...        ...                     ...
100   101        Bonn          Germany    309,869               31-Dec-12
101   102       Malm枚           Sweden    309,105               31-Mar-13
102   103  Nottingham   United Kingdom    308,735               30-Jun-12
103   104    Katowice           Poland    308,269               30-Jun-12
104   105      Kaunas        Lithuania    306,888                1-Jan-13

[105 rows x 5 columns]
--------------------------------------------------------------------------------
                                     0          1    ...        103         104
Rank                                   1          2  ...        104         105
City                           London[2]     Berlin  ...   Katowice      Kaunas
State                     United Kingdom    Germany  ...     Poland   Lithuania
Population                     8,615,246  3,437,916  ...    308,269     306,888
Date of census/estimate         1-Jun-14  31-May-14  ...  30-Jun-12    1-Jan-13

[5 rows x 105 columns]

按轴进行排序：

print(data.sort_index(axis=0, ascending=False))  # 当 axis=1 按照纵轴排序, ascending：升序

     Rank        City            State Population Date of census/estimate
104   105      Kaunas        Lithuania    306,888                1-Jan-13
103   104    Katowice           Poland    308,269               30-Jun-12
102   103  Nottingham   United Kingdom    308,735               30-Jun-12
101   102       Malm枚           Sweden    309,105               31-Mar-13
100   101        Bonn          Germany    309,869               31-Dec-12
..    ...         ...              ...        ...                     ...
4       5       Paris           France  2,273,305                1-Jan-13
3       4        Rome            Italy  2,872,086               30-Sep-14
2       3      Madrid            Spain  3,165,235                1-Jan-14
1       2      Berlin          Germany  3,437,916               31-May-14
0       1   London[2]   United Kingdom  8,615,246                1-Jun-14

按值进行排序：

print(data.sort_values(['City']))

    Rank       City         State Population Date of census/estimate
91    92     Aarhus       Denmark    326,676                1-Oct-14
85    86   Alicante         Spain    334,678                1-Jan-12
22    23  Amsterdam   Netherlands    813,562               31-May-14
58    59    Antwerp       Belgium    510,610                1-Jan-14
33    34     Athens        Greece    664,046               24-May-11
..   ...        ...           ...        ...                     ...
34    35    Wroc艂aw        Poland    632,432               31-Mar-14
82    83  Wuppertal       Germany    342,885               31-Dec-12
23    24     Zagreb       Croatia    790,017               31-Mar-11
32    33   Zaragoza         Spain    666,058                1-Jan-14
27    28       艁贸d藕        Poland    709,757               31-Mar-14

二、选择

获取

选择一个单独的列，这将返回一个Series, 等同于data.State

print(data['State'])

0       United Kingdom
1              Germany
2                Spain
3                Italy
4               France
            ...       
100            Germany
101             Sweden
102     United Kingdom
103             Poland
104          Lithuania
Name: State, Length: 105, dtype: object

通过[]进行选择，这将会进行切片

print(data[:5])

	   Rank       City            State Population Date of census/estimate
0     1  London[2]   United Kingdom  8,615,246                1-Jun-14
1     2     Berlin          Germany  3,437,916               31-May-14
2     3     Madrid            Spain  3,165,235                1-Jan-14
3     4       Rome            Italy  2,872,086               30-Sep-14
4     5      Paris           France  2,273,305                1-Jan-13

通过标签选择

使用标签来获取一个交叉的区域

print(data.loc[data.index[0]])

Rank                                     1
City                             London[2]
State                       United Kingdom
Population                       8,615,246
Date of census/estimate           1-Jun-14
Name: 0, dtype: object

通过标签在多个轴上进行选择

print(data.loc[:, ['State', 'Population']])

               State Population
0     United Kingdom  8,615,246
1            Germany  3,437,916
2              Spain  3,165,235
3              Italy  2,872,086
4             France  2,273,305
..               ...        ...
100          Germany    309,869
101           Sweden    309,105
102   United Kingdom    308,735
103           Poland    308,269
104        Lithuania    306,888

[105 rows x 2 columns]

标签切片

print(data.loc[1: 4, ['State', 'Population']])
print(data.loc[1: 4, 'Rank':'Population'])
print(data.loc[[1, 3], 'City':'Population'])
# loc方法里，可以用切片的方法也可以用标签单独取值，这里说的切片和你想的不一样
# 其实二维数组就是张表了，有字段，有值就是一个表结构，由横纵两个轴构建。横：axis=0, 纵：axis=1
# 横纵轴用于定位元素的（因为在科学统计时我们往往需要批量的操作数据），批量操作数据就需要在宏观上
# 定义数据，定义的方式是把它们都放在列表里，通过下标来取值，而它们每个字段又是横纵方向的键，那值
# 当然是跟在屁股后面的整段数据。这或许就是二维数组的本质：将表格的每行每列按照 '键'='值' 它就是
# Series， Series交织起来的结构叫 DataFrame
# 属于个人理解(不喜勿喷)

这张表的结构就是：DataFrame

type	index	Series	Series	Series
-	-	Rank	State	Population
Series	0	1	A	A
Series	1	2	B	B

对于返回的对象进行维度缩减

print(data.loc[1, ['State', 'Population']])  # 说的挺高级，就是定位数据，返回<class 'pandas.core.series.Series'>

State           Germany
Population    3,437,916
Name: 1, dtype: object

获取一个标量

print(data.loc[1, 'Population'])  # 确实像一颗洋葱，如果你愿意一层一层的拨开我的心，你会发现，你会压抑，最深处的秘密。

3,437,916
<class 'str'>

快速访问一个标量(与上一个方法等价)

print(data.at[1, 'Population'])  # 与5是等价的

3,437,916
<class 'str'>

通过位置选择

通过传递数值进行位置选择(选择的是行)

print(data.iloc[1])

Rank                               2
City                          Berlin
State                        Germany
Population                 3,437,916
Date of census/estimate    31-May-14
Name: 1, dtype: object

通过数值进行切片

data.iloc[1:3, 0: 4]

   Rank    City     State Population
1     2  Berlin   Germany  3,437,916
2     3  Madrid     Spain  3,165,235

通过指定一个位置的列表

data.iloc[[1, 3, 5], [0, 1, 2]]

   Rank       City     State
1     2     Berlin   Germany
3     4       Rome     Italy
5     6  Bucharest   Romania

对行进行切片

print(data.iloc[1:3, :])

   Rank    City     State Population Date of census/estimate
1     2  Berlin   Germany  3,437,916               31-May-14
2     3  Madrid     Spain  3,165,235                1-Jan-14

对列进行切片

print(data.iloc[:, 0:3])

     Rank        City            State
0       1   London[2]   United Kingdom
1       2      Berlin          Germany
2       3      Madrid            Spain
3       4        Rome            Italy
4       5       Paris           France
..    ...         ...              ...
100   101        Bonn          Germany
101   102       Malm枚           Sweden
102   103  Nottingham   United Kingdom
103   104    Katowice           Poland
104   105      Kaunas        Lithuania

[105 rows x 3 columns]

获取特定的值

print(data.iloc[1, 1])
print(data.at[1,1])

Berlin
Berlin

布尔索引

使用一个单独列的值来选择数据

data.Population = data.Population.apply(lambda x: int(x.replace(',', '')))  
# 相当于获取到Population下的所有数据然后利用匿名函数 修改 数据结构 然后重新赋值给 
# Population 这个字段
print(data[data.Population > 1000000])

    Rank          City            State  Population Date of census/estimate
0      1     London[2]   United Kingdom     8615246                1-Jun-14
1      2        Berlin          Germany     3437916               31-May-14
2      3        Madrid            Spain     3165235                1-Jan-14
3      4          Rome            Italy     2872086               30-Sep-14
4      5         Paris           France     2273305                1-Jan-13
5      6     Bucharest          Romania     1883425               20-Oct-11
6      7        Vienna          Austria     1794770                1-Jan-15
7      8   Hamburg[10]          Germany     1746342               30-Dec-13
8      9      Budapest          Hungary     1744665                1-Jan-14
9     10        Warsaw           Poland     1729119               31-Mar-14
10    11     Barcelona            Spain     1602386                1-Jan-14
11    12        Munich          Germany     1407836               31-Dec-13
12    13         Milan            Italy     1332516               30-Sep-14
13    14         Sofia         Bulgaria     1291895               14-Dec-14
14    15        Prague   Czech Republic     1246780                1-Jan-13
15    16  Brussels[17]          Belgium     1175831                1-Jan-14
16    17    Birmingham   United Kingdom     1092330               30-Jun-13
17    18       Cologne          Germany     1034175               31-Dec-13

使用where操作来选择数据：
```
print(data[data > 0])
```

使用isin()方法来过滤

a = [x for x in range(len(data.index))]
a = pd.Series(a, index=data.index)  # 这列数据的索引必须和原数据一致
data1 = data.copy()
data1['E'] = a
print(data1[data1['E'].isin(['2', '4'])])

   Rank    City    State  Population Date of census/estimate  E
2     3  Madrid    Spain     3165235                1-Jan-14  2
4     5   Paris   France     2273305                1-Jan-13  4

设置

设置一个新的列：
```
#上篇已经插入了
```

通过标签设置新的值：

data1.at[data.index[0], 'f'] = 1
print(data1)

     Rank        City            State  Population Date of census/estimate    f
0       1   London[2]   United Kingdom     8615246                1-Jun-14    1
1       2      Berlin          Germany     3437916               31-May-14    1
2       3      Madrid            Spain     3165235                1-Jan-14    2
3       4        Rome            Italy     2872086               30-Sep-14    3
4       5       Paris           France     2273305                1-Jan-13    4
..    ...         ...              ...         ...                     ...  ...
100   101        Bonn          Germany      309869               31-Dec-12  100
101   102       Malm枚           Sweden      309105               31-Mar-13  101
102   103  Nottingham   United Kingdom      308735               30-Jun-12  102
103   104    Katowice           Poland      308269               30-Jun-12  103
104   105      Kaunas        Lithuania      306888                1-Jan-13  104

[105 rows x 6 columns]

通过位置设置新的值：

data1.iat[1,2] = 0
print(data1)

     Rank        City            State  Population Date of census/estimate    f
0       1   London[2]   United Kingdom     8615246                1-Jun-14    0
1       2      Berlin                0     3437916               31-May-14    1
2       3      Madrid            Spain     3165235                1-Jan-14    2
3       4        Rome            Italy     2872086               30-Sep-14    3
4       5       Paris           France     2273305                1-Jan-13    4
..    ...         ...              ...         ...                     ...  ...
100   101        Bonn          Germany      309869               31-Dec-12  100
101   102       Malm枚           Sweden      309105               31-Mar-13  101
102   103  Nottingham   United Kingdom      308735               30-Jun-12  102
103   104    Katowice           Poland      308269               30-Jun-12  103
104   105      Kaunas        Lithuania      306888                1-Jan-13  104

[105 rows x 6 columns]

通过一个numpy数组设置一组新值

data1.loc[:, 'D'] = np.array([5] * len(data1))
print(data1)

     Rank        City            State  ...  Date of census/estimate    f  D
0       1   London[2]   United Kingdom  ...                 1-Jun-14    0  5
1       2      Berlin          Germany  ...                31-May-14    1  5
2       3      Madrid            Spain  ...                 1-Jan-14    2  5
3       4        Rome            Italy  ...                30-Sep-14    3  5
4       5       Paris           France  ...                 1-Jan-13    4  5
..    ...         ...              ...  ...                      ...  ... ..
100   101        Bonn          Germany  ...                31-Dec-12  100  5
101   102       Malm枚           Sweden  ...                31-Mar-13  101  5
102   103  Nottingham   United Kingdom  ...                30-Jun-12  102  5
103   104    Katowice           Poland  ...                30-Jun-12  103  5
104   105      Kaunas        Lithuania  ...                 1-Jan-13  104  5

[105 rows x 7 columns]

通过where操作来设置新的值：

data2.f[data2.f > 0] = -data2.f
print(data2)

     Rank        City            State  Population Date of census/estimate    f
0       1   London[2]   United Kingdom     8615246                1-Jun-14    0
1       2      Berlin          Germany     3437916               31-May-14   -1
2       3      Madrid            Spain     3165235                1-Jan-14   -2
3       4        Rome            Italy     2872086               30-Sep-14   -3
4       5       Paris           France     2273305                1-Jan-13   -4
..    ...         ...              ...         ...                     ...  ...
100   101        Bonn          Germany      309869               31-Dec-12 -100
101   102       Malm枚           Sweden      309105               31-Mar-13 -101
102   103  Nottingham   United Kingdom      308735               30-Jun-12 -102
103   104    Katowice           Poland      308269               30-Jun-12 -103
104   105      Kaunas        Lithuania      306888                1-Jan-13 -104

[105 rows x 6 columns]

三、空值处理

在pandas中，使用np.nan来替代空值，这些值将默认不包含在计算中。

index()方法可以对指定轴上的索引进行改变/增加/删除操作，返回原始数据的拷贝。

data3 = data2.reindex(index=data2.index, columns=list(data2.columns) + ['E'])
data3.loc[0:2, 'E'] = 1
print(data3)

     Rank        City            State  ...  Date of census/estimate    f    E
	0       1   London[2]   United Kingdom  ...                 1-Jun-14    0  1.0
	1       2      Berlin          Germany  ...                31-May-14    1  1.0
	2       3      Madrid            Spain  ...                 1-Jan-14    2  1.0
	3       4        Rome            Italy  ...                30-Sep-14    3  NaN
	4       5       Paris           France  ...                 1-Jan-13    4  NaN
	..    ...         ...              ...  ...                      ...  ...  ...
	100   101        Bonn          Germany  ...                31-Dec-12  100  NaN
	101   102       Malm枚           Sweden  ...                31-Mar-13  101  NaN
	102   103  Nottingham   United Kingdom  ...                30-Jun-12  102  NaN
	103   104    Katowice           Poland  ...                30-Jun-12  103  NaN
	104   105      Kaunas        Lithuania  ...                 1-Jan-13  104  NaN
	
	[105 rows x 7 columns]
	```

去掉包含缺失值的行：

data4 = data3.dropna()
print(data4)

   Rank       City            State  Population Date of census/estimate  f    E
0     1  London[2]   United Kingdom     8615246                1-Jun-14  0  0.0
1     2     Berlin          Germany     3437916               31-May-14  1  1.0
2     3     Madrid            Spain     3165235                1-Jan-14  2  2.0

对缺失值进行填充：

data2.loc[0, 'f'] = None
data3 = data2.fillna(value=5)
data3.f = data3.f.apply(lambda x: int(x))
print(data3)

	Rank        City            State  Population Date of census/estimate    f
0       1   London[2]   United Kingdom     8615246                1-Jun-14    5
1       2      Berlin          Germany     3437916               31-May-14    1
2       3      Madrid            Spain     3165235                1-Jan-14    2
3       4        Rome            Italy     2872086               30-Sep-14    3
4       5       Paris           France     2273305                1-Jan-13    4
..    ...         ...              ...         ...                     ...  ...
100   101        Bonn          Germany      309869               31-Dec-12  100
101   102       Malm枚           Sweden      309105               31-Mar-13  101
102   103  Nottingham   United Kingdom      308735               30-Jun-12  102
103   104    Katowice           Poland      308269               30-Jun-12  103
104   105      Kaunas        Lithuania      306888                1-Jan-13  104

[105 rows x 6 columns]

对数据进行布尔填充：

data2.loc[0, 'f'] = None
data3 = pd.isnull(data2)
print(data3)

      Rank   City  State  Population  Date of census/estimate      f
0    False  False  False       False                    False   True
1    False  False  False       False                    False  False
2    False  False  False       False                    False  False
3    False  False  False       False                    False  False
4    False  False  False       False                    False  False
..     ...    ...    ...         ...                      ...    ...
100  False  False  False       False                    False  False
101  False  False  False       False                    False  False
102  False  False  False       False                    False  False
103  False  False  False       False                    False  False
104  False  False  False       False                    False  False

[105 rows x 6 columns]

四、相关操作

统计（相关操作需要数据不包含空值）

执行描述性统计

print(round(data2.mean(), 2))

Rank              53.06
Population    787679.09
f                 52.50
dtype: float64

在其他轴上进行相同的操作

print(round(data2.mean(1), 2))  # 就是按照纵轴来取平均值，很少用

0      4307623.50
1      1145973.00
2      1055080.00
3       957364.33
4       757771.33
          ...    
100     103356.67
101     103102.67
102     102980.00
103     102825.33
104     102365.67
Length: 105, dtype: float64

对于拥有不同维度，需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播

Apply

对数据应用函数

data.Population = data.Population.apply(lambda x: int(x.replace(',', '')))

直方图
具体请参照：Histogramming and Discretization

s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()

0    0
1    2
2    1
3    2
4    1
5    1
6    3
7    4
8    1
9    5
dtype: int32

1    4
2    2
5    1
4    1
3    1
0    1
dtype: int64

字符串方法
Series对象在其str属性中配备了一组字符串处理方法，可以很容易的应用到数组中的每个元素，如下段代码所示。更详细参考：Working with text data¶

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object