pandas入门2--关于pandas的一些基本功能

最新推荐文章于 2025-08-11 15:28:13 发布

原创最新推荐文章于 2025-08-11 15:28:13 发布 · 261 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#pandas #python #数据分析

《利用Python进行数据分析》相关笔记专栏收录该内容

19 篇文章

订阅专栏

本文介绍了Pandas库的基本功能，包括重建索引、数据选择与过滤、算术运算、函数应用、排序与排名等核心操作，帮助读者掌握高效处理数据集的方法。

文章目录

二、基本功能

二、基本功能

本节会指引你了解与Series或DataFrame中数据交互的基础机制。

2.1 重建索引

reindex是pandas对象的重要方法，该方法用于创建一个符合新索引的新对象。

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)
---------------------------------------------------------
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Series调用reindex方法时，会将数据按照新的索引进行排列，如果某个索引值之前并不存在，则会引入缺失值：

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)
---------------------------------------------------------
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于顺序数据，比如时间序列，在重建索引时可能会需要进行插值或者填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值，ffill方法会将值前向补充：

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3)

obj4 = obj3.reindex(range(6), method='ffill')
print(obj4)
---------------------------------------------------------
0      blue
2    purple
4    yellow
dtype: object

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在DataFrame中，reindex可以改变行索引、列索引，也可以同时改变二者。当仅传入一个序列时，结果中的行会重建索引。

frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['ohio', 'texas', 'california'])
print(frame)

frame2 = frame.reindex(['a','b','c','d'])
print(frame2)
---------------------------------------------------------
   ohio  texas  california
a     0      1           2
c     3      4           5
d     6      7           8

 ohio  texas  california
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

列可以使用columns关键字重建索引：

state = ['texas', 'utah', 'california']
frame3 = frame.reindex(columns = state)
print(frame3)
---------------------------------------------------------
   texas  utah  california
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

reindex方法的参数

参数	描述
index	新建作为索引的序列，可以是索引实例或任意其他序列型Python数据结构，索引使用时无须复制
method	插值方式；‘ffill’为前向填充，而‘bfill’是后向填充
fill_value	通过重新索引引入缺失数据时使用的替代者
limit	当前向或后向填充时，所需填充的最大尺寸间隙（以元素数量）
tolance	当前向或后向填充时，所需填充的不精确匹配下的最大尺寸间隙（以绝对数字距离）
level	匹配MultiIndex级别的简单索引；否则选择子集
copy	如果为True，即使新索引等于旧索引，也是总是复制底层数据；如果为False，则在索引相同时不要复制数据

2.2 轴向上删除条目

如果你已经拥有索引数组或不含条目的列表，在轴向上删除一个或更多的条目就非常容易，但这样需要一些数据操作和集合逻辑，drop方法会返回一个含有指示值或轴向上删除值的新对象：

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd','e'])
print(obj)

new_obj = obj.drop('b')
print(new_obj)
---------------------------------------------------------
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
c    2.0
d    3.0
e    4.0
dtype: float64

在DataFrame中，索引值可以从轴向上删除。在调用drop时使用标签序列会根据行标签删除值（轴 0 ）

data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
        index=['Ohio', 'Colorado', 'Utah', 'New York'],
        columns=['one', 'two', 'three', 'four'])
print(data)

data1 = data.drop(['Colorado', 'Ohio'])
print(data1)
---------------------------------------------------------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

可以通过传递axis =1 或者 axis = ‘columns’来从列中删除值：

data1 = data.drop('two', axis=1)
print(data1)

data1 = data.drop(['two', 'four'], axis='columns')
print(data1)
---------------------------------------------------------
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

在使用drop时可以，使用inplace属性，它会修改Series或DataFrame的尺寸或形状，这些方法直接操作原对象而不返回新对象：

obj.drop('c',inplace=True)
print(obj)
---------------------------------------------------------
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

2.3 索引、选择与过滤

Series的索引（obj[…]）与NumPy数组索引的功能类似，只不过Series的索引值可以不仅仅是整数。

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)

print(obj['b'])
print(obj[1])
print(obj[2:4])
print(obj[['b', 'a', 'c']])
print(obj[[1, 3]])
print(obj[obj<2])
---------------------------------------------------------
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

1.0

1.0

c    2.0
d    3.0
dtype: float64

b    1.0
a    0.0
c    2.0
dtype: float64

b    1.0
d    3.0
dtype: float64

a    0.0
b    1.0
dtype: float64

普通的Python切片中是不包含尾部的，Series的切片与之不同：

print(obj['b':'c'])
---------------------------------------------------------
b    1.0
c    2.0
dtype: float64

使用这些方法设置时会修改Series相应的部分：

obj['b':'c']=5
print(obj)
---------------------------------------------------------
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

使用单个值或序列，可以从DataFrame中索引出一个或多个列：

data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)
print(data['two'])
print(data[['three', 'one']])
---------------------------------------------------------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

这种索引方式也有特殊案例。首先可以根据一个布尔值数组切片或选择数据：

print(data[:2])
print(data[data['three']>5])
print(data[:2][['one','two']])
---------------------------------------------------------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

        one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

          one  two
Ohio        0    1
Colorado    4    5

行选择语法data[:2].传递单个元素或一个列表到第二个[]符号中可以选择列。

另一种就是使用布尔值DataFrame进行索引，布尔值DataFrame可以是对标量值进行比较产生的：

data1 = data < 5
print(data1)
data[data<5]=0
print(data)
---------------------------------------------------------
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

针对DataFrame在行上的标签索引，这里将介绍特殊的索引符号loc和iloc。他们允许你使用轴标签（loc）或整数标签（iloc）以NumPy风格的语法从DataFrame中选出数组的行和列的子集。
通过标签选出单行多列的数据作为基础示例：

data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
          index=['Ohio', 'Colorado', 'Utah', 'New York'],
          columns=['one', 'two', 'three', 'four'])
print(data)
data1 = data.loc['Colorado', ['two', 'three']]
print(data1)
---------------------------------------------------------
         one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

two      5
three    6
Name: Colorado, dtype: int32

然后我们使用整数标签iloc进行类似的数据选择：

data1 = data.iloc[2, [3, 0, 1]]
print(data1)
four    11
one      8
two      9
Name: Utah, dtype: int32

除了单个标签或标签列表之外，索引功能还可以用于切片：

data1 = data.loc[:'Utah','two']
print(data1)
data1 = data.iloc[:, :3]
print(data1)
data1 = data.iloc[:, :3][data.three > 5]
print(data1)
---------------------------------------------------------
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

          one  two  three
Ohio        0    1      2
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14

          one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14

下列表提供了DataFrame选择数据的一些方法。

类型	描述
df[val]	从DataFrame中选择单列或列序列；特殊情况的便利：布尔数组（过滤行），切片（切片行）或布尔值DataFrame(根据某些标准设置的值)
df.loc[val]	根据标签选择DataFrame的单行或多行
df.loc[:,val]	根据标签选择单列或者多列
df.loc[val1,val2]	同时选择行和列中的一部分
df.iloc[where]	根据整数位置选择单行或多行
df.iloc[:,where]	根据整数位置选择单列或多列
df.iloc[where_i,where_j]	根据整数位置选择行和列
df.at[label_i,label_j]	根据行、列标签选择单个标量值
df.iat[i,j]	根据行、列整数位置选择单个标量值
reindex 方法	通过标签选择行或列
get_value,set_value 方法	根据行和列的标签设置单个值

2.4 整数索引

在pandas对象上使上用整数索引对新用户来说经常会产生歧义，这是因为它和在列表、元组等Python内建数据结构上进行索引有些许不同。例如：

ser = pd.Series(np.arange(3.))
print(ser)
print(ser[-1])
---------------------------------------------------------
ValueError: -1 is not in range
raise KeyError(key) from err
KeyError: -1

在这个例子中，虽说pandas可以’回退‘到整数索引，但是这样的方式难免会引起一些微小的错误。假设我们有一个索引，它包含了0，1，2但是推断用户所需要的索引方式（标签索引或位置索引）是很难的：

ser = pd.Series(np.arange(3.))
print(ser)
---------------------------------------------------------
0    0.0
1    1.0
2    2.0
dtype: float64

但是对于非整数索引，则不会有这种潜在的歧义：

ser2 = pd.Series(np.arange(3.),index=['a','b','c'])
print(ser2)
print(ser2[-1])
---------------------------------------------------------
a    0.0
b    1.0
c    2.0
dtype: float64

2.0

为了保持一致性，如果你有一个包含整数的轴索引，数据选择时请始终使用标签索引，为了更精确地处理，可以使用loc（用于标签）或iloc（用于整数）：

print(ser[:1])
ser1 = ser.loc[:1]
print(ser1)
---------------------------------------------------------
0    0.0
dtype: float64
0    0.0
1    1.0
dtype: float64

2.5 算术和数据对齐

不同索引的对象之间的算术行为是pandas提供给一些应用的一项重要特性。当你对象相加时，如果存在某个索引对不相同，则返回结果的索引将是索引对的并集。对数据库用户来说，这个特性类似于索引标签的自动外连接。

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
print(s1)
print(s2)

s3 = s1 + s2
print(s3)
---------------------------------------------------------
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

在没有交叠的标签位置上，内部数据对齐会产生缺失值。缺失值会在后续的算术操作上产生影响。

而在DataFrame的示例中，行和列上都会执行对齐：

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                   columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                   columns=list('bde'),
               index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1)
print(df2)
df3 = df1 + df2
print(df3)
---------------------------------------------------------
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

将df1 和 df2加在一起，返回一个DataFrame，它的索引、列是每个DataFrame的索引、列的并集。

因此在两个不同的索引化对象之间进行算术操作时，你可能会想要使用特殊值填充值，比如当轴标签在一个对象中存在，在另一个对象中不存在时，你想将缺失值填充为0：

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))

print(df1)
print(df2)
df2.loc[1, 'b'] = np.nan
df3 = df2 + df1
print(df3)
---------------------------------------------------------
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

将这些df添加到一起会导致在一些不重叠的位置出现Na值，但是当我们在df1上使用add方法，我将df2和一个fill_value作为参数传入：

df4 = df1.add(df2, fill_value=0)
print(df4)
---------------------------------------------------------

      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

下表是Series和DataFrame的算术方法。

方法	描述
add，radd	加法（+）
sub，rsub	减法（-）
div，rdiv	除法（/）
floordiv，rfloordiv	整除（//）
mul，rmul	乘法（*）
pow，rpow	幂次方（**）

这些方法中的每一个都有一个以r开头的副本，这些副本方法的参数是翻转的。因此下面两个语句的结果是等价的：

df4 = 1/df1
df5 = df1 .rdiv(1)
print(df4)
print(df5)
---------------------------------------------------------
       a         b         c         d
0    inf  1.000000  0.500000  0.333333
1  0.250  0.200000  0.166667  0.142857
2  0.125  0.111111  0.100000  0.090909

       a         b         c         d
0    inf  1.000000  0.500000  0.333333
1  0.250  0.200000  0.166667  0.142857
2  0.125  0.111111  0.100000  0.090909

与此相对的一点，当对Series或DataFrame重建索引时，你也可以指定一个不同的填充值：

df6 = df1.reindex(columns=df2.columns, fill_value=1)
print(df6)
---------------------------------------------------------
     a    b     c     d  e
0  0.0  1.0   2.0   3.0  1
1  4.0  5.0   6.0   7.0  1
2  8.0  9.0  10.0  11.0  1

DataFrame和Series间的算术操作与NumPy中不同维度数组间的操作类似。

arr = np.arange(12.).reshape((3, 4))
print(arr)
print(arr[0])
arr1 = arr - arr[0]
print(arr1)
---------------------------------------------------------
[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]
 
[0. 1. 2. 3.]

[[0. 0. 0. 0.]
 [4. 4. 4. 4.]
 [8. 8. 8. 8.]]

当我们从arr中减去arr[0]时，减法在每一行都进行了操作。这就是所谓的广播机制。而DataFrame和Series间的操作是类似的：

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
print(frame)
print(series)
---------------------------------------------------------
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

默认情况下，DataFrame和Series的数学操作中会将Series的索引和DataFrame的列进行匹配，并广播到各行：

a1 = frame - series
print(a1)
---------------------------------------------------------
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

如果一个索引值不在DataFrame的列中，也不在Series的索引中，则对象会重建索引并形成联合：

series2 = pd.Series(range(3), index=['b', 'e', 'f'])
print(series2)
a2 = frame + series2
print(a2)
---------------------------------------------------------
b    0
e    1
f    2
dtype: int64
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN

如果你想改为在列上进行广播，在行上匹配，你必须使用算术方法中的一种。例如：

series3 = frame['d']
print(series3)
a3 = frame.sub(series3, axis='index')
print(a3)
---------------------------------------------------------
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

传递的axis值是用于匹配轴的。上面的示例中表示我们需要在DataFrame的行索引上对行匹配（axis=‘index’或axis=0），并进行广播。

2.6 函数应用和映射

NumPy的通用函数（逐元素数组方法）对pandas对像也有效：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)

frame1 = np.abs(frame)
print(frame1)
---------------------------------------------------------
               b         d         e
Utah    0.573853 -0.818399 -0.038307
Ohio   -0.100243  0.453171  0.572316
Texas  -0.061038 -0.045380  1.255340
Oregon -1.377192  0.718416 -0.292260

               b         d         e
Utah    0.573853  0.818399  0.038307
Ohio    0.100243  0.453171  0.572316
Texas   0.061038  0.045380  1.255340
Oregon  1.377192  0.718416  0.292260

另一个常用的操作是将函数应用到一行或一列的一维数组上。DataFrame的apply方法可以实现这个功能：

f = lambda x: x.max() - x.min()
frame2 = frame.apply(f)
print(frame2)
---------------------------------------------------------
b    1.951045
d    1.536815
e    1.547601
dtype: float64

这里的函数f，可以计算Series最大值和最小值的差，会被frame中的每一列调用一次。结果是以frame的列作为索引的Series。

如果你传递axis = ‘columns’给apply函数，函数将会被每行调用一次：

frame3 = frame.apply(f,axis='columns')
print(frame3)
---------------------------------------------------------
Utah      1.392252
Ohio      0.672559
Texas     1.316378
Oregon    2.095608
dtype: float64

大部分最常用的数组统计（比如sum和mean）都是DataFrame的方法，因此计算统计值时使用apply并不是必需的。

传递给apply的函数并不一定要返回一个标量值，也可以返回带有多个值的Series：

 def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame4 = frame.apply(f)
print(frame4)
---------------------------------------------------------
            b         d        e
min -1.377192 -0.818399 -0.29226
max  0.573853  0.718416  1.25534

逐元素的Python函数也可以使用，假设你想要根据frame中的每个浮点数计算一个格式化字符串，可以使用applymap方法：

format = lambda x:'%.2f' % x
frame5 = frame.applymap(format)
print(frame5)
---------------------------------------------------------
           b      d      e
Utah     0.57  -0.82  -0.04
Ohio    -0.10   0.45   0.57
Texas   -0.06  -0.05   1.26
Oregon  -1.38   0.72  -0.29

使用applymap作为函数名是因为Series有map方法，可以将一个逐元素的函数应用到Series上：

frame6 = frame['e'].map(format)
print(frame6)
---------------------------------------------------------
Utah      -0.04
Ohio       0.57
Texas      1.26
Oregon    -0.29
Name: e, dtype: object

2.7 排序和排名

根据某些准则对数据集进行排序是另一个重要的内建操作。如需按行或列索引进行字典型排序，需要使用sort_index方法，该方法返回一个新的、排序好的对象：

obj = pd.Series(range(4),index=['D', 'A', 'B', 'C'])
obj1 = obj.sort_index()
print(obj1)
---------------------------------------------------------
A    1
B    2
C    3
D    0
dtype: int64

在DataFrame中，你可以在各个轴上按索引排序：

frame = pd.DataFrame(np.arange(8.).reshape((2, 4)),
                     columns=list('dabc'),
                     index=['three', 'one'])
frame1 = frame.sort_index()
print(frame1)
---------------------------------------------------------
         d    a    b    c
one    4.0  5.0  6.0  7.0
three  0.0  1.0  2.0  3.0

设置参数axis=1 则是以列索引排序：

frame1 = frame.sort_index(axis=1)
print(frame1)
---------------------------------------------------------
         a    b    c    d
three  1.0  2.0  3.0  0.0
one    5.0  6.0  7.0  4.0

数据默认会升序排序，但是也可以按照降序排序：

frame1 = frame.sort_index(axis=1,ascending = False)
print(frame1)
---------------------------------------------------------
         d    c    b    a
three  0.0  3.0  2.0  1.0
one    4.0  7.0  6.0  5.0

上述办法都是依据索引值进行排序，如果要根据Series的值进行排序，使用sort_values方法：

obj2 = pd.Series([4, 7, -3, 2])
print(obj2)
obj3 = obj2.sort_values()
print(obj3)
---------------------------------------------------------
0    4
1    7
2   -3
3    2
dtype: int64

2   -3
3    2
0    4
1    7
dtype: int64

而默认情况下，所有缺失值都会被排序至Series的尾部：

obj4 = pd.Series([4, np.nan, 7, np.nan, -3, 2])
print(obj4)
obj5 = obj4.sort_values()
print(obj5)
---------------------------------------------------------
0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

当对DataFrame排序时，你可以使用一列或多列作为排序键。为了实现这个功能，传递一个或多个列名给sort_values的可选参数by：

frame2 = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print(frame2)
frame3 = frame2.sort_values(by='b')
print(frame3)
---------------------------------------------------------
   b  a
0  4  0
1  7  1
2 -3  0
3  2  1

   b  a
2 -3  0
3  2  1
0  4  0
1  7  1

对多列排序时，传递列名的列表。先对第一列进行排序，在不变第一列的排序下，再对第二列进行排序：

frame3 = frame2.sort_values(by=['a', 'b'])
print(frame3)
---------------------------------------------------------
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1

排名是指对数组从1到有效数据点总数分配名次的操作。Series和DataFrame的rank方法是实现排名的方法，默认情况下，rank通过将平均排名分配到每个组来打破平级关系：

obj = pd.Series([7,-5,7,4,2,0,4])
obj1 = obj.rank()
print(obj1)
---------------------------------------------------------
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

排名也可以根据他们在数据中的观察顺序进行分配：

obj1 = obj.rank(method='first')
print(obj1)
---------------------------------------------------------
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

在上面的例子中，对条目0和2设置的名次为6和7，而不是之前的平均排名6.5，是因为在数据中标签0在标签2的前面。你可以降序排名：

obj1 = obj.rank(ascending=False,method='max')
print(obj1)
---------------------------------------------------------
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

上面例子将值分配给组中的最大排名，下表是可用的平级关系打破方法列表。

方法	描述
‘average’	默认：在每个组中分配平均排名
‘min’	对整个组使用最小排名
‘max’	对整个组使用最大排名
‘first’	按照值在数据中出现的次序分配排名
‘dense’	类似于method = ‘min’，但组间排名总是增加1，而不是一个组中的相等元素的数量

2.8 含有重复标签的轴索引

在上述示例中，轴索引都是唯一的（索引值）。尽管很多pandas函数（比如reindex）需要标签是唯一的，但这个并不是强制的。让我们考虑一下小型的带有重复索引的Series：

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
print(obj)
---------------------------------------------------------
a    0
a    1
b    2
b    3
c    4
dtype: int64

索引的is_unique属性可以告诉你它的标签是否唯一的：

a = obj.index.is_unique
print(a)
---------------------------------------------------------
False

带有重复索引的情况下，数据选择是与之前操作有差别的主要情况。根据一个标签索引多个条目会返回一个序列，而单个条目会返回标量值：

a1 = obj['a']
print(a1)
a2 = obj['c']
print(a2)
---------------------------------------------------------
a    0
a    1
dtype: int64
4

相同的逻辑可以扩展到在DataFrame中进行行索引：

df = pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
print(df)
print(df.loc['b'])
---------------------------------------------------------
          0         1         2
a -0.253438  0.320287 -2.592365
a  0.744373  1.327803 -0.792010
b -0.217219  0.980731  0.192851
b  0.455168  1.429445  0.235813

          0         1         2
b -0.217219  0.980731  0.192851
b  0.455168  1.429445  0.235813

结束了下班！