pandas 新手指引

# 10 Minutes to pandas pandas入门教程,面向新手,如需高级教程,移步[pandas cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook) 按照约定,一般按照如下形式对pandas进行导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 使用ipython notebook绘图,加入如下命令
%matplotlib inline
## pandas 对象的创建 通过python列表构造一个pandas的Series对象
# Series 自动生成索引
s = pd.Series([1,2,3,np.nan, 4,5])
s
0 1.0 1 2.0 2 3.0 3 NaN 4 4.0 5 5.0 dtype: float64 使用numpy的数组创建一个pandas的DataFrame,指定日期序列为行索引,指定’A’,’B’,’C’,’D’为列索引
dates = pd.date_range('20160101', periods=6)
dates
DatetimeIndex([‘2016-01-01’, ‘2016-01-02’, ‘2016-01-03’, ‘2016-01-04’, ‘2016-01-05’, ‘2016-01-06’], dtype=’datetime64[ns]’, freq=’D’)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df
ABCD
2016-01-01-0.808397-1.5489731.0133111.981536
2016-01-021.9665430.4682940.168445-1.474018
2016-01-03-1.3084540.625522-2.4655471.757797
2016-01-04-1.430586-0.732160-0.0348360.216295
2016-01-05-0.5197480.386824-2.775289-0.088892
2016-01-061.027911-0.3110890.6467250.773003

或者,可以通过传递字典来创建Dataframe对象

df2 = pd.DataFrame({
        'A': pd.Timestamp('20160701'),
        'B': pd.Series(1, index=list(range(4)), dtype='float32'),
        'C': np.array([3] * 4, dtype='int32'),
        'D': pd.Categorical(['Test', 'Train', 'Test', 'Train']),
        'E': 1,
        'F': 'foo'
    })
df2
ABCDEF
02016-07-011.03Test1foo
12016-07-011.03Train1foo
22016-07-011.03Test1foo
32016-07-011.03Train1foo

df2的每一列都拥有不同的类型,可以通过dtypes属性查看

df2.dtypes
A datetime64[ns] B float32 C int32 D category E int64 F object dtype: object ## 查看数据 查看数据的前几行和后几行
# head(n) 方法查看前n行
df.head(3)
ABCD
2016-01-01-0.808397-1.5489731.0133111.981536
2016-01-021.9665430.4682940.168445-1.474018
2016-01-03-1.3084540.625522-2.4655471.757797
# tail(n) 方法查看后n行
df.tail(2)
ABCD
2016-01-05-0.5197480.386824-2.775289-0.088892
2016-01-061.027911-0.3110890.6467250.773003

查看DataFrame的行列信息和数据信息

df.index
DatetimeIndex([‘2016-01-01’, ‘2016-01-02’, ‘2016-01-03’, ‘2016-01-04’, ‘2016-01-05’, ‘2016-01-06’], dtype=’datetime64[ns]’, freq=’D’)
df.columns
Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=’object’)
df.values
array([[-0.8083965 , -1.54897301, 1.01331067, 1.98153559], [ 1.96654297, 0.46829396, 0.16844495, -1.47401779], [-1.30845444, 0.62552152, -2.46554656, 1.75779664], [-1.43058558, -0.73216048, -0.03483597, 0.21629514], [-0.51974796, 0.3868237 , -2.77528915, -0.08889186], [ 1.02791114, -0.31108897, 0.64672466, 0.77300274]]) 简单数据统计信息
df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean-0.178788-0.185264-0.5745320.527620
std1.3721790.8469271.6294331.278357
min-1.430586-1.548973-2.775289-1.474018
25%-1.183440-0.626893-1.857869-0.012595
50%-0.6640720.0378670.0668040.494649
75%0.6409960.4479260.5271551.511598
max1.9665430.6255221.0133111.981536

矩阵的转置

df.T
2016-01-01 00:00:002016-01-02 00:00:002016-01-03 00:00:002016-01-04 00:00:002016-01-05 00:00:002016-01-06 00:00:00
A-0.8083971.966543-1.308454-1.430586-0.5197481.027911
B-1.5489730.4682940.625522-0.7321600.386824-0.311089
C1.0133110.168445-2.465547-0.034836-2.7752890.646725
D1.981536-1.4740181.7577970.216295-0.0888920.773003

索引排序

df.sort_index(axis=1, ascending=False)
DCBA
2016-01-011.9815361.013311-1.548973-0.808397
2016-01-02-1.4740180.1684450.4682941.966543
2016-01-031.757797-2.4655470.625522-1.308454
2016-01-040.216295-0.034836-0.732160-1.430586
2016-01-05-0.088892-2.7752890.386824-0.519748
2016-01-060.7730030.646725-0.3110891.027911

通过某一列值进行排序

df.sort_values(by='C')
ABCD
2016-01-05-0.5197480.386824-2.775289-0.088892
2016-01-03-1.3084540.625522-2.4655471.757797
2016-01-04-1.430586-0.732160-0.0348360.216295
2016-01-021.9665430.4682940.168445-1.474018
2016-01-061.027911-0.3110890.6467250.773003
2016-01-01-0.808397-1.5489731.0133111.981536
## 数据的选择 ### 获取数据
# 选取单独一列数据,获取到的数据是Series对象,
# df['A'] 等价与 df.A
df['A']
2016-01-01 -0.808397 2016-01-02 1.966543 2016-01-03 -1.308454 2016-01-04 -1.430586 2016-01-05 -0.519748 2016-01-06 1.027911 Freq: D, Name: A, dtype: float64 通过切片技术,获取相对应的行. **PS: 末端包含**
df[0:3]
ABCD
2016-01-01-0.808397-1.5489731.0133111.981536
2016-01-021.9665430.4682940.168445-1.474018
2016-01-03-1.3084540.625522-2.4655471.757797
df['20160102': '20160104']
ABCD
2016-01-021.9665430.4682940.168445-1.474018
2016-01-03-1.3084540.625522-2.4655471.757797
2016-01-04-1.430586-0.732160-0.0348360.216295

通过标签选择数据。 ps:使用 .at, .iat, .loc, .iloc, .ix属性来实现

df.loc['20160101']
A -0.808397 B -1.548973 C 1.013311 D 1.981536 Name: 2016-01-01 00:00:00, dtype: float64 花式索引,选取多列数据
df.loc[:, ['A','B']]
AB
2016-01-01-0.808397-1.548973
2016-01-021.9665430.468294
2016-01-03-1.3084540.625522
2016-01-04-1.430586-0.732160
2016-01-05-0.5197480.386824
2016-01-061.027911-0.311089

通过标签来切片

df.loc['20160102':'20160103', ['B','C']]
BC
2016-01-020.4682940.168445
2016-01-030.625522-2.465547
df.loc['20160103',['A', 'B']]
A -1.308454 B 0.625522 Name: 2016-01-03 00:00:00, dtype: float64 获取单个数据
print(df.loc['20160101', 'A'])
# 或者使用.at属性
print(df.at[dates[0], 'A'])
-0.808396502432 -0.808396502432 通过位置选择,通过整数坐标来获取数据片段或单个数据,此时切片跟python, numpy一致,即末端不包含。
df.iloc[3]
A -1.430586 B -0.732160 C -0.034836 D 0.216295 Name: 2016-01-04 00:00:00, dtype: float64
df.iloc[3:5, 0:2]
AB
2016-01-04-1.430586-0.732160
2016-01-05-0.5197480.386824
df.iloc[[1,2,4],[0,2]]
AC
2016-01-021.9665430.168445
2016-01-03-1.308454-2.465547
2016-01-05-0.519748-2.775289
df.iloc[1:3]
ABCD
2016-01-021.9665430.4682940.168445-1.474018
2016-01-03-1.3084540.625522-2.4655471.757797
df.iloc[:,1:3]
BC
2016-01-01-1.5489731.013311
2016-01-020.4682940.168445
2016-01-030.625522-2.465547
2016-01-04-0.732160-0.034836
2016-01-050.386824-2.775289
2016-01-06-0.3110890.646725
df.iloc[1,1] # 等价与 df.iat{1,1}
0.46829396335234058 ### 布尔型索引
df[df.A > 0]
ABCD
2016-01-021.9665430.4682940.168445-1.474018
2016-01-061.027911-0.3110890.6467250.773003
df[df >0]
ABCD
2016-01-01NaNNaN1.0133111.981536
2016-01-021.9665430.4682940.168445NaN
2016-01-03NaN0.625522NaN1.757797
2016-01-04NaNNaNNaN0.216295
2016-01-05NaN0.386824NaNNaN
2016-01-061.027911NaN0.6467250.773003

使用 isin 方法来筛选数据

df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
ABCDE
2016-01-01-0.808397-1.5489731.0133111.981536one
2016-01-021.9665430.4682940.168445-1.474018one
2016-01-03-1.3084540.625522-2.4655471.757797two
2016-01-04-1.430586-0.732160-0.0348360.216295three
2016-01-05-0.5197480.386824-2.775289-0.088892four
2016-01-061.027911-0.3110890.6467250.773003three
df2[df2['E'].isin(['one','three'])]
ABCDE
2016-01-01-0.808397-1.5489731.0133111.981536one
2016-01-021.9665430.4682940.168445-1.474018one
2016-01-04-1.430586-0.732160-0.0348360.216295three
2016-01-061.027911-0.3110890.6467250.773003three
### 数据的设置 通过索引匹配插入新的一列
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20160102', periods=6))
df['F'] = s1
df
ABCDF
2016-01-010.0000000.0000001.0133111.981536NaN
2016-01-021.9665430.4682940.168445-1.4740181.0
2016-01-03-1.3084540.625522-2.4655471.7577972.0
2016-01-04-1.430586-0.732160-0.0348360.2162953.0
2016-01-05-0.5197480.386824-2.775289-0.0888924.0
2016-01-061.027911-0.3110890.6467250.7730035.0

也可以通过标签来赋值

df.at[dates[0], 'A'] = 0
df
ABCDF
2016-01-010.0000000.0000001.0133111.981536NaN
2016-01-021.9665430.4682940.168445-1.4740181.0
2016-01-03-1.3084540.625522-2.4655471.7577972.0
2016-01-04-1.430586-0.732160-0.0348360.2162953.0
2016-01-05-0.5197480.386824-2.775289-0.0888924.0
2016-01-061.027911-0.3110890.6467250.7730035.0

通过位来赋值

df.iat[0, 1] = 0
df
ABCDF
2016-01-010.0000000.0000001.0133111.981536NaN
2016-01-021.9665430.4682940.168445-1.4740181.0
2016-01-03-1.3084540.625522-2.4655471.7577972.0
2016-01-04-1.430586-0.732160-0.0348360.2162953.0
2016-01-05-0.5197480.386824-2.775289-0.0888924.0
2016-01-061.027911-0.3110890.6467250.7730035.0

将numpy数组赋值给某列

df.loc[:, 'D'] = np.array([5] * len(df))
df
ABCDF
2016-01-010.0000000.0000001.0133115NaN
2016-01-021.9665430.4682940.16844551.0
2016-01-03-1.3084540.625522-2.46554752.0
2016-01-04-1.430586-0.732160-0.03483653.0
2016-01-05-0.5197480.386824-2.77528954.0
2016-01-061.027911-0.3110890.64672555.0
df2 = df.copy()
df2[df2>0] = -df2
df2
ABCDF
2016-01-010.0000000.000000-1.013311-5NaN
2016-01-02-1.966543-0.468294-0.168445-5-1.0
2016-01-03-1.308454-0.625522-2.465547-5-2.0
2016-01-04-1.430586-0.732160-0.034836-5-3.0
2016-01-05-0.519748-0.386824-2.775289-5-4.0
2016-01-06-1.027911-0.311089-0.646725-5-5.0
## 处理缺失数据 pandas使用np.nan来表征缺失数据,这些数据在计算时默认不会被使用
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
df1
ABCDFE
2016-01-010.0000000.0000001.0133115NaN1.0
2016-01-021.9665430.4682940.16844551.01.0
2016-01-03-1.3084540.625522-2.46554752.0NaN
2016-01-04-1.430586-0.732160-0.03483653.0NaN

方案一、丢弃所有数据缺失的行

df1.dropna(how='any')
ABCDFE
2016-01-021.9665430.4682940.16844551.01.0

方案二、填充缺失值

df1.fillna(value=5)
ABCDFE
2016-01-010.0000000.0000001.01331155.01.0
2016-01-021.9665430.4682940.16844551.01.0
2016-01-03-1.3084540.625522-2.46554752.05.0
2016-01-04-1.430586-0.732160-0.03483653.05.0

可以获取到缺失数据的掩码

df.isnull()
# 值为True的位置即是数据缺失的位置
ABCDF
2016-01-01FalseFalseFalseFalseTrue
2016-01-02FalseFalseFalseFalseFalse
2016-01-03FalseFalseFalseFalseFalse
2016-01-04FalseFalseFalseFalseFalse
2016-01-05FalseFalseFalseFalseFalse
2016-01-06FalseFalseFalseFalseFalse
## 数据操作 数据操作默认不会使用缺失值 ### 状态操作
df.mean()
A -0.044056 B 0.072898 C -0.574532 D 5.000000 F 3.000000 dtype: float64
# 行内统计
df.mean(1)
2016-01-01 1.503328 2016-01-02 1.720656 2016-01-03 0.770304 2016-01-04 1.160484 2016-01-05 1.218357 2016-01-06 2.272709 Freq: D, dtype: float64 当对维度不同的数据进行操作时, 数据之间需要对其,pandas会自动在不同维度之间进行广播
s = pd.Series([1,3,5,np.nan, 6, 8], index=dates).shift(2)
s
2016-01-01 NaN 2016-01-02 NaN 2016-01-03 1.0 2016-01-04 3.0 2016-01-05 5.0 2016-01-06 NaN Freq: D, dtype: float64
df.sub(s, axis='index')
ABCDF
2016-01-01NaNNaNNaNNaNNaN
2016-01-02NaNNaNNaNNaNNaN
2016-01-03-2.308454-0.374478-3.4655474.01.0
2016-01-04-4.430586-3.732160-3.0348362.00.0
2016-01-05-5.519748-4.613176-7.7752890.0-1.0
2016-01-06NaNNaNNaNNaNNaN
### 函数应用 将函数应用到数据上
df.apply(np.cumsum, axis=0)
ABCDF
2016-01-010.0000000.0000001.0133115NaN
2016-01-021.9665430.4682941.181756101.0
2016-01-030.6580891.093815-1.283791153.0
2016-01-04-0.7724970.361655-1.318627206.0
2016-01-05-1.2922450.748479-4.0939162510.0
2016-01-06-0.2643340.437390-3.4471913015.0
df.apply(lambda x: x.max() - x.min())
A 3.397129 B 1.357682 C 3.788600 D 0.000000 F 4.000000 dtype: float64 ## 直方图
s = pd.Series(np.random.randint(0,7,size=10))
s
0 1 1 5 2 6 3 5 4 6 5 4 6 0 7 3 8 6 9 5 dtype: int64
s.value_counts()
6 3 5 3 4 1 3 1 1 1 0 1 dtype: int64 ### 字符串操作
s = pd.Series(['A', 'B', 'C', 'Aaba', 'BAcd', np.nan, 'CBA', 'dog', 'CAT'])
s.str.lower()
0 a 1 b 2 c 3 aaba 4 bacd 5 NaN 6 cba 7 dog 8 cat dtype: object ## 数据合并 ### 数据连接 concat
# concat 函数将对象连接在一起
df = pd.DataFrame(np.random.randn(10,4))
df
0123
0-0.859307-0.723708-1.1216631.438285
1-0.168126-0.3435670.6789400.394126
2-0.5410901.908998-0.543378-0.109371
3-1.1081100.332687-1.3207521.022476
40.591171-1.2598590.9302660.688108
5-0.065470-0.9573941.423691-0.295647
61.7281510.1627090.836916-0.573260
7-0.0254870.307945-0.414787-0.045495
8-0.601439-0.167967-1.1983040.242739
90.495473-0.3484951.5997570.184015
pieces = [df[:3], df[5:]]
pd.concat(pieces)
0123
0-0.859307-0.723708-1.1216631.438285
1-0.168126-0.3435670.6789400.394126
2-0.5410901.908998-0.543378-0.109371
5-0.065470-0.9573941.423691-0.295647
61.7281510.1627090.836916-0.573260
7-0.0254870.307945-0.414787-0.045495
8-0.601439-0.167967-1.1983040.242739
90.495473-0.3484951.5997570.184015
### 数据SQL风格的连接 merge
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1,2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4,5]})
left
keylval
0foo1
1foo2
right
keyrval
0foo4
1foo5
pd.merge(left, right, on='key')
keylvalrval
0foo14
1foo15
2foo24
3foo25
### Append 在Dateframe对象尾部添加数据
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
ABCD
0-0.535803-0.319896-0.313776-0.401106
1-0.2314052.0582330.7712220.170204
2-1.699222-0.0982050.4651000.295165
3-0.273538-0.902247-0.3283480.771312
40.0801180.7968000.5644680.526290
50.4852210.478245-0.943854-0.097568
6-0.4409150.134749-0.840602-0.836712
7-0.283432-0.0292331.725972-0.878117
s = df.iloc[3]
df.append(s, ignore_index=True)
ABCD
0-0.535803-0.319896-0.313776-0.401106
1-0.2314052.0582330.7712220.170204
2-1.699222-0.0982050.4651000.295165
3-0.273538-0.902247-0.3283480.771312
40.0801180.7968000.5644680.526290
50.4852210.478245-0.943854-0.097568
6-0.4409150.134749-0.840602-0.836712
7-0.283432-0.0292331.725972-0.878117
8-0.273538-0.902247-0.3283480.771312
### 分组聚合 groupby 分组聚合一般而言经历一下步骤: - 按约束条件将数据分组 - 使用某个函数处理分好组的数据 - 将处理好的数据合并在一起
df = pd.DataFrame({
        'A': ['foo','bar','foo','bar','foo','bar','foo','foo'],
        'B': ['one','one','two','three', 'two', 'two', 'one', 'three'],
        'C': np.random.randn(8),
        'D': np.random.randn(8)
    })
df
ABCD
0fooone0.9964710.659993
1barone0.990690-1.102114
2footwo-0.1389650.236194
3barthree0.0334690.253152
4footwo-0.5743200.081216
5bartwo1.9924560.939238
6fooone-0.514013-1.610422
7foothree-0.640462-1.606399

分组聚合后将sum函数应用到分组数据上

df.groupby('A').sum()
CD
A
bar3.0166150.090276
foo-0.871289-2.239418

多重分组聚合之后,应用sum函数

df.groupby(['A', 'B']).sum()
CD
AB
barone0.990690-1.102114
three0.0334690.253152
two1.9924560.939238
fooone0.482458-0.950429
three-0.640462-1.606399
two-0.7132850.317410
## 重塑和轴向旋转 ### 轴向旋转
tuples = list(zip(*[
            ['bar','bar', 'baz', 'baz', 'foo','foo', 'qux', 'qux'],
            ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two' ]
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A','B'])
df2 = df[:4]
df2
AB
firstsecond
barone-0.0845951.495368
two-0.801703-0.663997
bazone-0.108681-0.986022
two-0.5248290.983664
# stack方法将列转换为行
stacked = df2.stack()
stacked
first second bar one A -0.084595 B 1.495368 two A -0.801703 B -0.663997 baz one A -0.108681 B -0.986022 two A -0.524829 B 0.983664 dtype: float64
# unstack方法将行转为列
stacked.unstack()
AB
firstsecond
barone-0.0845951.495368
two-0.801703-0.663997
bazone-0.108681-0.986022
two-0.5248290.983664
# 默认unstack操作的是最内层的数据,可以指定层数
stacked.unstack(1)
secondonetwo
first
barA-0.084595-0.801703
B1.495368-0.663997
bazA-0.108681-0.524829
B-0.9860220.983664
stacked.unstack(0)
firstbarbaz
second
oneA-0.084595-0.108681
B1.495368-0.986022
twoA-0.801703-0.524829
B-0.6639970.983664
### 透视表
df = pd.DataFrame({
        'A': ['one', 'two', 'three', 'four'] * 3,
        'B': ['A', 'B', 'C'] * 4,
        'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
        'D': np.random.randn(12),
        'E': np.random.randn(12)

    })
df
ABCDE
0oneAfoo0.319799-1.264188
1twoBfoo0.929552-0.092799
2threeCfoo-2.5100990.979121
3fourAbar1.7272110.083378
4oneBbar0.636672-0.167700
5twoCbar0.3377490.782511
6threeAfoo0.429180-2.415025
7fourBfoo0.334974-1.997174
8oneCfoo0.248257-1.003121
9twoAbar0.4653191.133168
10threeBbar0.111670-0.730784
11fourCbar-1.903981-0.089501
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
fourA1.727211NaN
BNaN0.334974
C-1.903981NaN
oneANaN0.319799
B0.636672NaN
CNaN0.248257
threeANaN0.429180
B0.111670NaN
CNaN-2.510099
twoA0.465319NaN
BNaN0.929552
C0.337749NaN
## 时间序列 pandas提供了简单有效的处理时间频率的函数。
rng = pd.date_range('1/1/2016', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample('10S').sum()
2016-01-01 00:00:00 2910 2016-01-01 00:00:10 2506 2016-01-01 00:00:20 2812 2016-01-01 00:00:30 2923 2016-01-01 00:00:40 2510 2016-01-01 00:00:50 2817 2016-01-01 00:01:00 2672 2016-01-01 00:01:10 2486 2016-01-01 00:01:20 3243 2016-01-01 00:01:30 2865 Freq: 10S, dtype: int64 时区转换
rng = pd.date_range('2/2/2016 00:00', periods=5 , freq='D')

ts = pd.Series(np.random.randn(len(rng)), rng)

ts
2016-02-02 -0.662500 2016-02-03 -0.762211 2016-02-04 0.954675 2016-02-05 -0.411404 2016-02-06 0.237898 Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')

ts_utc
2016-02-02 00:00:00+00:00 -0.662500 2016-02-03 00:00:00+00:00 -0.762211 2016-02-04 00:00:00+00:00 0.954675 2016-02-05 00:00:00+00:00 -0.411404 2016-02-06 00:00:00+00:00 0.237898 Freq: D, dtype: float64 转化到其它时区
ts_utc.tz_convert('Asia/Shanghai')
2016-02-02 08:00:00+08:00 -0.662500 2016-02-03 08:00:00+08:00 -0.762211 2016-02-04 08:00:00+08:00 0.954675 2016-02-05 08:00:00+08:00 -0.411404 2016-02-06 08:00:00+08:00 0.237898 Freq: D, dtype: float64 时间戳和时期之间的专转换
ran = pd.date_range('1/1/2016', periods=5, freq='M')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts
2016-02-02 -2.143138 2016-02-03 1.683414 2016-02-04 -0.427250 2016-02-05 -0.900378 2016-02-06 -1.039857 Freq: D, dtype: float64
ps = ts.to_period()
ps
2016-02-02 -2.143138 2016-02-03 1.683414 2016-02-04 -0.427250 2016-02-05 -0.900378 2016-02-06 -1.039857 Freq: D, dtype: float64
ps.to_timestamp()
2016-02-02 -2.143138 2016-02-03 1.683414 2016-02-04 -0.427250 2016-02-05 -0.900378 2016-02-06 -1.039857 Freq: D, dtype: float64 ## 种类类型
df = pd.DataFrame({
        "id": [1,2,3,4,5,6],
        "raw_grade": ['a','b','b','a','a','e']
    })

df['grade'] = df['raw_grade'].astype('category')
df['grade']
0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]
df.grade.cat.categories = ['very good', 'good', 'bad']
df.grade
0 very good 1 good 2 good 3 very good 4 very good 5 bad Name: grade, dtype: category Categories (3, object): [very good, good, bad]
df.grade = df.grade.cat.set_categories(['very bad', 'bad', 'medium', 'good', 'very good'])
df.grade
0 very good 1 good 2 good 3 very good 4 very good 5 bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
df.sort_values(by='grade')
idraw_gradegrade
56ebad
12bgood
23bgood
01avery good
34avery good
45avery good
df.groupby('grade').size()
grade
very bad     0
bad          1
medium       0
good         2
very good    3
dtype: int64

绘图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot(grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7ffa41fa9908>

这里写图片描述

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()

plt.figure()
df.plot(grid=True)
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x7ffa41de6e48>




<matplotlib.figure.Figure at 0x7ffa41f1e198>

这里写图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值