第一部分:http://blog.youkuaiyun.com/linxid/article/details/79249874
接上次介绍的内容,继续介绍pandas入门。
1. 合并
1.1 concat
pandas提供各种方便的工具,以便方便的将Series、Dataframe、Panel对象和各种集合,根据索引和相关的代数合并在一起,通过merge或join类型操作符。
列索引相同,直接进行合并即可,行索引并在一起。
In [73]: df = pd.DataFrame(np.random.randn(10, 4))
In [74]: df
Out[74]:
0 1 2 3
0 -0.548702 1.467327 -1.015962 -0.483075
1 1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952 0.991460 -0.919069 0.266046
3 -0.709661 1.669052 1.037882 -1.705775
4 -0.919854 -0.042379 1.247642 -0.009920
5 0.290213 0.495767 0.362949 1.548106
6 -1.131345 -0.089329 0.337863 -0.945867
7 -0.932132 1.956030 0.017587 -0.016692
8 -0.575247 0.254161 -1.143704 0.215897
9 1.193555 -0.077118 -0.408530 -0.862495
# break it into pieces
In [75]: pieces = [df[:3], df[3:7], df[7:]]
#数据被划分开
In [76]: pd.concat(pieces)
#根据索引将片段合并在一起
Out[76]:
0 1 2 3
0 -0.548702 1.467327 -1.015962 -0.483075
1 1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952 0.991460 -0.919069 0.266046
3 -0.709661 1.669052 1.037882 -1.705775
4 -0.919854 -0.042379 1.247642 -0.009920
5 0.290213 0.495767 0.362949 1.548106
6 -1.131345 -0.089329 0.337863 -0.945867
7 -0.932132 1.956030 0.017587 -0.016692
8 -0.575247 0.254161 -1.143704 0.215897
9 1.193555 -0.077118 -0.408530 -0.862495
1.2 Join
SQL风格的合并。依据列相同的索引,进行合并,相应值相同,则进行扩充。
In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [81]: pd.merge(left, right, on='key')
Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
另一个例子,是,行索引相同,对应的值进行相应的合并。
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2
In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5
In [86]: pd.merge(left, right, on='key')
Out[86]:
key lval rval
0 foo 1 4
1 bar 2 5
在合并时,要注意,如何取并集,关于索引如何做处理。
1.3 拼接
将行拼接到一个DataFrame。
In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [88]: df
Out[88]:
A B C D
0 1.346061 1.511763 1.627081 -0.990582
1 -0.441652 1.211526 0.268520 0.024580
2 -1.577585 0.396823 -0.105381 -0.532532
3 1.453749 1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346 0.339969 -0.693205
5 -0.339355 0.593616 0.884345 1.591431
6 0.141809 0.220390 0.435589 0.192451
7 -0.096701 0.803351 1.715071 -0.708758
In [89]: s = df.iloc[3]
In [90]: df.append(s, ignore_index=True)
Out[90]:
A B C D
0 1.346061 1.511763 1.627081 -0.990582
1 -0.441652 1.211526 0.268520 0.024580
2 -1.577585 0.396823 -0.105381 -0.532532
3 1.453749 1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346 0.339969 -0.693205
5 -0.339355 0.593616 0.884345 1.591431
6 0.141809 0.220390 0.435589 0.192451
7 -0.096701 0.803351 1.715071 -0.708758
8 1.453749 1.208843 -0.080952 -0.264610
2. Grouping
通过groupby,我们可以实现下列步骤:
- 基于某些准则将数据划分成组;
- 对每一个组独立应用一个函数;
- 将结果合并成一个数据结构
创建数据:
In [91]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
....:
In [92]: df
Out[92]:
A B C D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
2.1:df.groupby('A').sum()
对数据分组然后应用sum函数:
In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
2.2:df.groupby(['A','B']).sum()
对多列分组,形成一个等级索引,然后应用函数,也就是函数应用准则,更加细分,分组再进一层。
In [94]: df.groupby(['A','B']).sum()
Out[94]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
3.时间序列(Time Series)
pandas有简单但强大并且有效函数用与,试行重采样,在频率变换期间(例如:将每秒的数据,转换成每5分钟的数据)。这在金融应用中是非常常见的,但并仅限于此。
#生成一个时间序列,默认是间隔是day(天),freq为间隔单位,periods为间隔数,'1/1/2012'是起始日期
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
#建立一个随机数Series对象,和rng长度相同,且rng为其索引
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
#对ts对象进行重采样,也就是按不同的频率(周期)进行采样。采样后求和
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 25083
Freq: 5T, dtype: int64
- 时区表示:
In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
#建立一个以rng为索引的序列
In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [113]: ts
Out[113]:
2012-03-06 0.464000
2012-03-07 0.227371
2012-03-08 -0.496922
2012-03-09 0.306389
2012-03-10 -2.290613
Freq: D, dtype: float64
#确定时区
In [114]: ts_utc = ts.tz_localize('UTC')
In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 0.464000
2012-03-07 00:00:00+00:00 0.227371
2012-03-08 00:00:00+00:00 -0.496922
2012-03-09 00:00:00+00:00 0.306389
2012-03-10 00:00:00+00:00 -2.290613
Freq: D, dtype: float64
- 换到另一个时区:
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 0.464000
2012-03-06 19:00:00-05:00 0.227371
2012-03-07 19:00:00-05:00 -0.496922
2012-03-08 19:00:00-05:00 0.306389
2012-03-09 19:00:00-05:00 -2.290613
Freq: D, dtype: float64
- 转换时间表示间隔:
In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [119]: ts
Out[119]:
2012-01-31 -1.134623
2012-02-29 -1.561819
2012-03-31 -0.260838
2012-04-30 0.281957
2012-05-31 1.523962
Freq: M, dtype: float64
In [120]: ps = ts.to_period()
In [121]: ps
Out[121]:
2012-01 -1.134623
2012-02 -1.561819
2012-03 -0.260838
2012-04 0.281957
2012-05 1.523962
Freq: M, dtype: float64
In [122]: ps.to_timestamp()
Out[122]:
2012-01-01 -1.134623
2012-02-01 -1.561819
2012-03-01 -0.260838
2012-04-01 0.281957
2012-05-01 1.523962
Freq: MS, dtype: float64
4. 绘图(plotting)
简单实例:
* Series对象:
import matplotlib.pyplot as plt
#此语句可以保证,图片自己出现在Jupyter Notebook页面中
%matplotlib inline
ts = pd.Series(np.random.randn(20),index=pd.date_range('2/7/2018'.periods=20))
ts = ts.cumsum()
ts.plot()
* DataFrame对象:
DataFrame对象实现起来会更方便,可以对每一列,画图。
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
In [139]: df = df.cumsum()
In [140]: plt.figure(); df.plot(); plt.legend(loc='best')
5. 读入读出数据:
CSV
因为大部分情况下我们都是读取.csv数据,数据分析结果要求也是.csv数据。
* df.to_csv(‘file_name’):
将会创建一个.csv文档用于存储df数据
In [141]: df.to_csv('foo.csv')
- df.read_csv(‘file_name’):
In [142]: pd.read_csv('foo.csv')
Out[142]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409
6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
.. ... ... ... ... ...
993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]