分类
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df
'''
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e
'''
把raw_grade转换为分类类型
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
'''
0 a
1 b
2 b
3 a
4 e
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
'''
重命名类别名为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"]
'''
0 very good
1 good
2 good
3 very good
4 very bad
5 very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]
'''
对分类重新排序,并添加缺失的分类
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
'''
0 very good
1 good
2 good
3 very good
4 very bad
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
'''
排序是按照分类的顺序进行的,而不是字典序,也就是按照 [very bad, bad, medium, good, very good]顺序分类
df.sort_values(by="grade")
'''
id raw_grade grade
4 5 e very bad
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
'''
按分类分组时,也会显示空的分类
df.groupby("grade").size()
'''
grade
very bad 2
bad 0
medium 0
good 2
very good 2
dtype: int64
'''
绘图
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
对于DataFrame类型,plot()能很方便地画出所有列及其标签
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
获取数据的I/O
CSV
写入一个csv文件
df.to_csv('data/foo.csv')
从一个csv文件读入
pd.read_csv('data/foo.csv')
'''
Unnamed: 0 A B C D
0 2000-01-01 -0.547915 0.758504 -0.791553 -0.014276
1 2000-01-02 -1.478652 -1.824863 0.001416 -1.067864
2 2000-01-03 -3.070727 -1.655967 -0.551411 -2.188238
3 2000-01-04 -3.317105 -1.810778 0.354000 -1.995760
4 2000-01-05 -2.822308 -2.757077 1.210057 -1.250592
5 2000-01-06 -2.223865 -3.256931 3.658672 -1.232510
6 2000-01-07 -0.559411 -2.842820 4.860993 -0.418853
7 2000-01-08 -0.524812 -2.748762 6.280814 0.170616
8 2000-01-09 0.204528 -0.865022 5.560456 0.875871
9 2000-01-10 0.688895 -1.704024 5.226062 0.444483
10 2000-01-11 1.272138 -2.478377 2.763524 -1.229797
11 2000-01-12 1.255657 -2.994073 2.869688 0.745504
12 2000-01-13 -0.001803 -2.295823 2.738087 2.640307
13 2000-01-14 0.144209 -0.926810 4.087308 0.164138
14 2000-01-15 1.304811 0.625530 3.542700 -1.499446
15 2000-01-16 0.682026 0.552441 3.201389 -0.155674
16 2000-01-17 -1.065321 -0.339092 3.792634 -1.754946
17 2000-01-18 -1.268418 -3.512773 3.656896 -0.132124
18 2000-01-19 -1.936390 -1.069188 2.554887 0.413930
19 2000-01-20 -1.138678 -2.993870 3.094198 -0.956574
20 2000-01-21 -0.096721 -4.141108 3.341248 -0.436886
21 2000-01-22 -0.678763 -5.012663 0.823576 1.303377
22 2000-01-23 -2.498525 -6.075938 0.302908 -0.294204
23 2000-01-24 -0.937785 -6.768487 -0.078521 -0.165799
24 2000-01-25 -0.252338 -6.882860 0.545684 -1.108641
25 2000-01-26 -0.772653 -7.080129 0.764647 -3.041450
26 2000-01-27 -0.348831 -6.035375 0.507673 -2.522836
27 2000-01-28 1.751721 -5.756738 -0.469014 -2.825118
28 2000-01-29 0.799938 -4.559172 -0.086763 -3.243101
29 2000-01-30 1.149650 -4.433331 1.476689 -2.479956
... ... ... ... ... ...
970 2002-08-28 0.466756 38.780786 -6.743120 -15.206051
971 2002-08-29 1.903199 36.687258 -5.720742 -14.335689
972 2002-08-30 1.432801 37.186065 -6.274877 -14.879621
973 2002-08-31 2.218021 36.200395 -5.638178 -15.625389
974 2002-09-01 3.405318 36.987365 -4.441912 -14.162031
975 2002-09-02 3.854252 38.283545 -4.845949 -15.449302
976 2002-09-03 4.328033 37.123804 -6.006344 -14.158929
977 2002-09-04 5.219103 37.203486 -6.035182 -14.461389
978 2002-09-05 4.884547 38.192836 -6.447570 -13.541204
979 2002-09-06 5.817435 37.669998 -5.983732 -12.238303
980 2002-09-07 7.953334 37.503576 -4.783659 -12.092902
981 2002-09-08 8.575181 37.377622 -3.154240 -12.895208
982 2002-09-09 8.242240 38.155500 -3.285132 -13.483775
983 2002-09-10 9.699693 37.421024 -1.763105 -13.397276
984 2002-09-11 7.228027 38.070563 -1.250151 -14.455654
985 2002-09-12 7.276336 38.503057 -1.110047 -15.362900
986 2002-09-13 6.955453 37.038613 -0.571942 -14.527590
987 2002-09-14 6.232800 37.254276 -1.893050 -14.385401
988 2002-09-15 5.331802 36.671504 -1.782282 -13.645305
989 2002-09-16 5.825638 36.920733 -2.718137 -13.875984
990 2002-09-17 5.972137 36.785577 -3.923763 -13.281418
991 2002-09-18 5.154685 34.789858 -4.426825 -13.805227
992 2002-09-19 4.228035 34.400350 -4.431980 -14.520423
993 2002-09-20 3.047640 36.244469 -5.093584 -15.106412
994 2002-09-21 2.984361 38.095349 -5.560222 -16.584216
995 2002-09-22 3.187098 37.169742 -6.132250 -14.910771
996 2002-09-23 1.772470 36.399488 -6.257600 -13.668261
997 2002-09-24 1.886272 34.543307 -6.644915 -15.213739
998 2002-09-25 2.918438 33.722114 -5.736130 -14.919711
999 2002-09-26 2.608268 33.473087 -5.111581 -15.969800
1000 rows × 5 columns
'''
HDF5
HDFStores的读写
写入一个HDF5 Store
df.to_hdf('data/foo.h5', 'df')
从一个HDF5 Store读入
pd.read_hdf('data/foo.h5', 'df')
Excel
MS Excel的读写
写入一个Excel文件
df.to_excel('data/foo.xlsx', sheet_name='Sheet1')
从一个excel文件读入
pd.read_excel('data/foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])