Pandas必要的基本功能_flexible binary operations-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_41677555/article/details/82882620

本文详述了Pandas库中一系列基础功能，包括数据查看（Head和Tail）、属性访问、性能加速、二元操作、缺失值处理、比较操作、统计分析、函数应用、数据对齐与合并、描述性统计、索引排序、数据复制和Dtype操作等。介绍了如何利用Pandas进行高效数据处理，展示了诸如广播行为、比较操作、缺失值填充、数据合并、描述性统计方法等功能的使用方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

概述

Head和Tail

属性和原始的ndarray(Attributes and the raw ndarray(s))

操作性能加速(Accelerated operations)

灵活的二元操作(Flexible binary operations)

匹配/广播行为

缺失数据及其充填操作(Missing data / operations with fill values)

灵活的比较操作(Flexible Comparisons)

布尔值推断(Boolean Reductions)

比较对象是否相等(Comparing if objects are equivalent)

比较类数组对象(Comparing array-like objects)

组合重叠数据集(Combining overlapping data sets)

一般的DataFrame数据合并(General DataFrame Combine)

描述性统计

数据概要(Summarizing data: describe)

最小/最大值的索引(Index of Min/Max Values)

数据计数(直方图)/模式(Value counts (histogramming) / Mode)

离散化和分位数(Discretization and quantiling)

函数应用(Function application)

表级范围应用函数(Tablewise Function Application)

行级或列级范围函数应用(Row or Column-wise Function Application)

元素级范围应用函数(Applying Elementwise Functions)

Agg聚合API

多函数聚合(Aggregating with multiple functions)

字典聚合(Aggregating with a dict)

多类型数据聚合(Mixed Dtypes)

定制统计概要(Custom describe)

Transform API

多函数使用transform(Transform with multiple functions)

字典使用transform(Transforming with a dict)

重新索引和修改标签

与另外的对象进行重索引和对齐(Reindexing to align with another object)

使用align方法使对象彼此对齐(Aligning objects with each other with align)

重新索引时进行填充(Filling while reindexing)

重索引时的填充限制(Limits on filling while reindexing)

删除标签(Dropping labels from an axis)

重命名/映射标签(Renaming / mapping labels)

.dt 访问器(.dt accessor)

矢量化字符方法(Vectorized string methods)

排序(Sorting)

索引排序(By Index)

按数据值排序(By Values)

索引和数据值结合排序(By Indexes and Values)

searchsorted方法

最大/最小值(smallest / largest values)

多层级标签排序(Sorting by a multi-index column)

对象转换(object conversion)

性能和可延伸性(gotchas)

基于dtype选择列(Selecting columns based on dtype)

概述

本文将介绍Pandas中的一些必要的基础功能,同样只介绍与Series和DataFrame两种数据结构相关的.

先构造出用于演示的对象:

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,
   ...:                   columns=['A', 'B', 'C'])
   ...:

Head和Tail

这两个方法可以快速的查看一组数据的小抽样,默认的设置是5行.当然也可以自行设定行数.

head表示表头部分,tail表示表尾部分.

In [5]: long_series = pd.Series(np.random.randn(1000))

In [6]: long_series.head()
Out[6]: 
0    0.229453
1    0.304418
2    0.736135
3   -0.859631
4   -0.424100
dtype: float64

In [7]: long_series.tail(3)
Out[7]: 
997   -0.351587
998    1.136249
999   -0.448789
dtype: float64

属性和原始的ndarray(Attributes and the raw ndarray(s))

Pandas对象都有一些属性值,用来查看对象的原始数据结构.

shape:返回对象的数据轴的维度

Axis Lable: 轴标签对于Series就是index,对于DataFrame就包含index和columns.

我们知道Pandas对象数据的不可变性,但对于这两个属性来说,是可以在原对象基础上被修改的,而不是返回新的视图.

In [8]: df[:2]
Out[8]: 
                   A         B        C
2000-01-01  0.048869 -1.360687 -0.47901
2000-01-02 -0.859661 -0.231595 -0.52775

In [9]: df.columns = [x.lower() for x in df.columns]

In [10]: df
Out[10]: 
                   a         b         c
2000-01-01  0.048869 -1.360687 -0.479010
2000-01-02 -0.859661 -0.231595 -0.527750
2000-01-03 -1.296337  0.150680  0.123836
2000-01-04  0.571764  1.555563 -0.823761
2000-01-05  0.535420 -1.032853  1.469725
2000-01-06  1.304124  1.449735  0.203109
2000-01-07 -1.032011  0.969818 -0.962723
2000-01-08  1.382083 -0.938794  0.669142

如果想要获取Pandas对象中的真正的数据,访问values属性即可:

In [11]: s.values
Out[11]: array([-1.9339,  0.3773,  0.7341,  2.1416, -0.0112])

In [12]: df.values
Out[12]: 
array([[ 0.0489, -1.3607, -0.479 ],
       [-0.8597, -0.2316, -0.5278],
       [-1.2963,  0.1507,  0.1238],
       [ 0.5718,  1.5556, -0.8238],
       [ 0.5354, -1.0329,  1.4697],
       [ 1.3041,  1.4497,  0.2031],
       [-1.032 ,  0.9698, -0.9627],
       [ 1.3821, -0.9388,  0.6691]])

从访问values属性得到的数据结果可以看是ndarray类型的.我们知道ndarray是有dtype属性的(dtype种类很多,可以参阅Numpy).

若DataFrame对象中包含多种数据类型,如果其中含有字符类型的话

那么values属性返回的ndarray的dtype属性就被自适应为object类型

如果仅包含整型和浮点型数字的话,那么ndarray的dtype属性会被调整为float64.

In [11]: df= pd.DataFrame({'a' : [1, 2, 1], 'b' : [1, 'B', 3] })

In [12]: df.values.dtype

Out[12]: object


In [13]: df= pd.DataFrame({'a' : [1, 1.8, 1], 'b' : [1, 2.5, 3] })

In [14]: df.values.dtype

Out[14]: float64

操作性能加速(Accelerated operations)

Pandas使用了三方的numexpr库和bottleneck库来对某些数据类型的二进制和bool值操作进行加速.

这些库在处理超大数据集时特别有用,大大的提高了操作速度.

采用100列 X 100000行的数据集测试性能结果如下:

Operation	0.11.0 (ms)	Prior Version (ms)	Ratio to Prior
`df1 > df2`	13.32	125.35	0.1063
`df1 * df2`	21.71	36.63	0.5928
`df1 + df2`	22.04	36.50	0.6039

因为性能的大幅提高,强烈建议安装这些三方库,而且这些第三方库的使用都是默认的,当然也可以通过设置关闭:

pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

灵活的二元操作(Flexible binary operations)

对pandas对象的结构数据进行二元操作,有两点是需要特别注意的:

1.从高维度数据向低维数据的广播(broadcasting )

2.缺失值(Missing Value)的计算中的处理

匹配/广播行为

DataFrame对象的二元操作体现在add(),sub(),mul(),div()等方法,以及与这些方法相关的radd(),rsub()........等等.

对于广播行为来说,Series的输入是最重要的.

在使用二元操作的相关方法中,可以使用axis参数来匹配广播是传播方向,按index传播或者按columns传播.

In [14]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   ....:                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   ....:                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   ....: 

In [15]: df
Out[15]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [16]: row = df.iloc[1]

In [17]: column = df['two']

In [18]: df.sub(row, axis='columns')
Out[18]: 
        one       two     three
a -0.924269 -1.362632       NaN
b  0.000000  0.000000  0.000000
c  0.639504 -2.973170  2.565487
d       NaN -2.943392 -0.588625

In [19]: df.sub(row, axis=1)
Out[19]: 
        one       two     three
a -0.924269 -1.362632       NaN
b  0.000000  0.000000  0.000000
c  0.639504 -2.973170  2.565487
d       NaN -2.943392 -0.588625

In [20]: df.sub(column, axis='index')
Out[20]: 
        one  two     three
a -2.226031  0.0       NaN
b -2.664393  0.0 -3.121397
c  0.948280  0.0  2.417260
d       NaN  0.0 -0.766631

In [21]: df.sub(column, axis=0)
Out[21]: 
        one  two     three
a -2.226031  0.0       NaN
b -2.664393  0.0 -3.121397
c  0.948280  0.0  2.417260
d       NaN  0.0 -0.766631

对于多级索引的数据来说,可以增加一个level参数以控制层级

In [22]: dfmi = df.copy()

In [23]: dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
   ....:                                        names=['first','second'])
   ....: 

In [24]: dfmi.sub(column, axis=0, level='second')
Out[24]: 
                   one      two     three
first second                             
1     a      -2.226031  0.00000       NaN
      b      -2.664393  0.00000 -3.121397
      c       0.948280  0.00000  2.417260
2     a            NaN -1.58076 -2.347391

Pandas中的Series对象和Index对象可支持python内置的divmod()方法.该内置方法是以元组的方式返回商和余数.

但在处理Series对象和Index对象时,是以相同索引的Series方式返回结果:

In [28]: s = pd.Series(np.arange(10))

In [29]: s
Out[29]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [30]: div, rem = divmod(s, 3)

In [31]: div
Out[31]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [32]: rem
Out[32]: 
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [33]: idx = pd.Index(np.arange(10))

In [34]: idx
Out[34]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [35]: div, rem = divmod(idx, 3)

In [36]: div
Out[36]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [37]: rem
Out[37]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

内置divmod()方法同时也可以元素级别范围的应用:

In [38]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [39]: div
Out[39]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

In [40]: rem
Out[40]: 
0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64

缺失数据及其充填操作(Missing data / operations with fill values)

在对Series对象或DataFrame对象进行算术函数方法时,fill_value参数可以指定缺失值的替换值.

需要强调两点,一是只有在相同shape属性的对象进行算术函数且对应位置的元素不同时为缺失值时,fill_value参数才起作用.

因为在Pandas中,两个缺失值之间的计算结果永远是缺失值.

二是存在广播行为的计算中,fill_value也是无起作用的.

In [41]: df
Out[41]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [42]: df2
Out[42]: 
        one       two     three
a -1.101558  1.124472  1.000000
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [43]: df + df2
Out[43]: 
        one       two     three
a -2.203116  2.248945       NaN
b -0.354579  4.974208 -1.268586
c  0.924429 -0.972131  3.862388
d       NaN -0.912575 -2.445837

In [44]: df.add(df2, fill_value=0)
Out[44]: 
        one       two     three
a -2.203116  2.248945  1.000000
b -0.354579  4.974208 -1.268586
c  0.924429 -0.972131  3.862388
d       NaN -0.912575 -2.445837

灵活的比较操作(Flexible Comparisons)

Pandas中提供了eq,ne,lt,gt,le,ge 等比较方法,操作方式与上面介绍的算术函数的基本一致,这些比较方法返回的都是与原比较数据同shape的bool类型的数据.

在比较中,NaN与任何值(包括标量和NaN)进行大小比较永远都是返回False

只有进行不等于比较永远返回True

np.nan 与 np.nan 进行相等比较时永远是返回False

In [45]: df.gt(df2)
Out[45]: 
     one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [46]: df2.ne(df)
Out[46]: 
     one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False

布尔值推断(Boolean Reductions)

在Pandas中提供了empty,any(),all(),bool()这几个方法进行布尔值推断.

在DataFrame中,默认是按columns方法进行推断的,可以通过axis参数控制推断方向.

补充下:布尔值推断并不是值判断标量的True或False,而是指仅对bool值进行判断.

注意下面例子中也是对仅含bool值的数据对象进行判断的.

In [47]: (df > 0).all()
Out[47]: 
one      False
two      False
three    False
dtype: bool

In [48]: (df > 0).any()
Out[48]: 
one      True
two      True
three    True
dtype: bool

也可以进行链式判断:

In [49]: (df > 0).any().any()
Out[49]: True

通过empty属性可以判断对象是否为空:

In [50]: df.empty
Out[50]: False

In [51]: pd.DataFrame(columns=list('ABC')).empty
Out[51]: True

推断仅含单个bool值元素的Pandas对象:

In [52]: pd.Series([True]).bool()
Out[52]: True

In [53]: pd.Series([False]).bool()
Out[53]: False

In [54]: pd.DataFrame([[True]]).bool()
Out[54]: True

In [55]: pd.DataFrame([[False]]).bool()
Out[55]: False

警告:

如果想python语法一样,尝试如下判断,都将报错.

if df:
     ...

df and df2

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

比较对象是否相等(Comparing if objects are equivalent)

一般情况下,我们通过不同的计算方式可以得到同样的结果,比如df+df 和df *2.结合之前的bool值推断方法,

为了判断这两种计算方式的结果是否相同(结果是否出乎意料?):

In [56]: df+df == df*2
Out[56]: 
     one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True

In [57]: (df+df == df*2).all()
Out[57]: 
one      False
two       True
three    False
dtype: bool

这是因为对象中包含的了缺失值.因为在Pandas中缺失值进行等于比较永远都是False.

In [58]: np.nan == np.nan
Out[58]: False

所以,Pandas中提供了equals()方法来判断对象是否相同或相等.只要对象中的缺失值都是在同样的位置,则判断为相同或相等.

In [59]: (df+df).equals(df*2)
Out[59]: True

需要注意的,进行相同或相等推断是,对象的index属性也必须一致.

In [60]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})

In [61]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])

In [62]: df1.equals(df2)
Out[62]: False

In [63]: df1.equals(df2.sort_index())
Out[63]: True

比较类数组对象(Comparing array-like objects)

注意,此处讨论的是array-like对象,在Pandas中,也就只有Series对象和Index对象才是array-like对象.

在元素级别范围上,比较array-like对象和标量是很简单的:

In [64]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[64]: 
0     True
1    False
2    False
dtype: bool

In [65]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[65]: array([ True, False, False], dtype=bool)

同样,在元素级别的范围上,也支持两个长度相等的array-like对象的比较:

In [66]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[66]: 
0     True
1     True
2    False
dtype: bool

In [67]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[67]: 
0     True
1     True
2    False
dtype: bool

如果比较长度不同的两个array-like对象,将发生错误:

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare

注意,与Numpy库的方式不同,在Pandas中,ndarray对象比较也是可以进行广播的:

In [68]: np.array([1, 2, 3]) == np.array([2])
Out[68]: array([False,  True, False], dtype=bool)

但如果两个比较对象之间不能进行广播,则返回False:

In [69]: np.array([1, 2, 3]) == np.array([1, 2])
Out[69]: False

组合重叠数据集(Combining overlapping data sets)

当两个相似数据集进行重叠合并时,我们是希望将第一个数据中的缺失值替换为第一个数据中的已知的值.

这时可以使用combine_first方法:

 In [70]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
   ....:                     'B' : [np.nan, 2., 3., np.nan, 6.]})
   ....: 

In [71]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
   ....:                     'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
   ....: 

In [72]: df1
Out[72]: 
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [73]: df2
Out[73]: 
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [74]: df1.combine_first(df2)
Out[74]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

一般的DataFrame数据合并(General DataFrame Combine)

实质上,combine_first方法是调用的DataFrame的combine方法.

该方法接收一个与要合并的Dataframe对象和一个合并函数

首先将输入的DataFrame对象进行数据对齐后,调用合并函数进行数据合并:

combiner = lambda x, y: np.where(pd.isna(x), y, x)

df1.combine(df,combiner)
Out[139]: 
    one  two  three
a   2.0    9   10.0
b   6.0    5    1.0
c   5.0    8  100.0
d  10.0    6    1.0

描述性统计

Pandas中提供许多的描述性统计方法,有些返回聚合数据后的结构,如sum,mean等

有些返回一个和原数据相同shape的数据集,如:cumsum,cumprod等

总得来说,这些方法都支持通过指定asix参数从而控制计算的轴方向.

axis参数可以是轴的名称,不如'index','columns',或者整数,如0,1.

n [77]: df
Out[77]: 
        one       two     three
a -1.101558  1.124472       NaN
b -0.177289  2.487104 -0.634293
c  0.462215 -0.486066  1.931194
d       NaN -0.456288 -1.222918

In [78]: df.mean(0)
Out[78]: 
one     -0.272211
two      0.667306
three    0.024661
dtype: float64

In [79]: df.mean(1)
Out[79]: 
a    0.011457
b    0.558507
c    0.635781
d   -0.839603
dtype: float64

所以的方法都有一个skipna参数,默认为True,可以指定是否忽略缺失值.

In [80]: df.sum(0, skipna=False)
Out[80]: 
one           NaN
two      2.669223
three         NaN
dtype: float64

In [81]: df.sum(axis=1, skipna=True)
Out[81]: 
a    0.022914
b    1.675522
c    1.907343
d   -1.679206
dtype: float64

结合广播/算术行为，可以很简洁地描述各种统计过程，如标准化(使数据的均值为零，标准差为1):

In [82]: ts_stand = (df - df.mean()) / df.std()

In [83]: ts_stand.std()
Out[83]: 
one      1.0
two      1.0
three    1.0
dtype: float64

In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [85]: xs_stand.std(1)
Out[85]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注意cumsum()方法和comprod()方法都保留了缺失值NAN,而expanding()方法和rolling()方法则不同

关于这方面更多的信息,可以参阅: Pandas的计算工具.

下面是一个常用函数的快速参考汇总表。每个方法还接受一个可选的level参数，该参数仅适用于对象具有分层索引的情况。

注意，一些NumPy方法，如mean、std和sum，在默认情况下会排除缺失值:

In [87]: np.mean(df['one'])
Out[87]: -0.27221094480450114

In [88]: np.mean(df['one'].values)
Out[88]: nan

Series.nunique() 方法返回Series中非缺失值的唯一性数据的数量.

In [89]: series = pd.Series(np.random.randn(500))

In [90]: series[20:500] = np.nan

In [91]: series[10:20]  = 5

In [92]: series.nunique()
Out[92]: 11

数据概要(Summarizing data: describe)

使用describe()方法可以快速统计Series对象或者DataFrame对象的摘要数据:

In [93]: series = pd.Series(np.random.randn(1000))

In [94]: series[::2] = np.nan

In [95]: series.describe()
Out[95]: 
count    500.000000
mean      -0.032127
std        1.067484
min       -3.463789
25%       -0.725523
50%       -0.053230
75%        0.679790
max        3.120271
dtype: float64

In [96]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [97]: frame.iloc[::2] = np.nan

In [98]: frame.describe()
Out[98]: 
                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean    -0.045109   -0.052045    0.024520    0.006117    0.001141
std      1.029268    1.002320    1.042793    1.040134    1.005207
min     -2.915767   -3.294023   -3.610499   -2.907036   -3.010899
25%     -0.763783   -0.720389   -0.609600   -0.665896   -0.682900
50%     -0.086033   -0.048843    0.006093    0.043191   -0.001651
75%      0.663399    0.620980    0.728382    0.735973    0.656439
max      3.400646    2.925597    3.416896    3.331522    3.007143

同时可以通过percentiles参数指定自定义的分位数(中位数总是默认显示的):

In [99]: series.describe(percentiles=[.05, .25, .75, .95])
Out[99]: 
count    500.000000
mean      -0.032127
std        1.067484
min       -3.463789
5%        -1.733545
25%       -0.725523
50%       -0.053230
75%        0.679790
95%        1.854383
max        3.120271
dtype: float64

对应非数字类型的Series对象,describe()方法将给出唯一值和最常见值的数量:

In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [101]: s.describe()
Out[101]: 
count     9
unique    4
top       a
freq      5
dtype: object

注意,在多类型混合的DataFrame对象中,describle()方法将只对数字类型的columns进行摘要统计

如不存在数字类型的columns,则类似Series对象一样,返回出唯一值和最常见值的数量

In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [103]: frame.describe()
Out[103]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

可以通过include和exclude两个参数指定列表的方式控制describle方法的这种行为

特殊值all也可以作为include和exclude两个参数的值:

In [104]: frame.describe(include=['object'])
Out[104]: 
          a
count     4
unique    2
top     Yes
freq      2

In [105]: frame.describe(include=['number'])
Out[105]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [106]: frame.describe(include='all')
Out[106]: 
          a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000

最小/最大值的索引(Index of Min/Max Values)

使用idxmax()/idxmin()方法查找Series对象或DataFrame对象中的最大/最小值的索引值(Index):

In [107]: s1 = pd.Series(np.random.randn(5))

In [108]: s1
Out[108]: 
0   -1.649461
1    0.169660
2    1.246181