这章主要是机器学习的知识?
嘛,说是机器学习,结果还是numpy、pandas、scipy、matplotlib这些玩意儿的使用,没有任何tensorflow、caffe、keras等高级库的使用说明。
np.logspace(0,1)的结果不是只有两行,看也知道省略了很多,真实结果如下
array([ 1. , 1.04811313, 1.09854114, 1.1513954 , 1.20679264,
1.26485522, 1.32571137, 1.38949549, 1.45634848, 1.52641797,
1.59985872, 1.67683294, 1.75751062, 1.84206997, 1.93069773,
2.02358965, 2.12095089, 2.22299648, 2.32995181, 2.44205309,
2.55954792, 2.6826958 , 2.8117687 , 2.9470517 , 3.0888436 ,
3.23745754, 3.39322177, 3.55648031, 3.72759372, 3.90693994,
4.09491506, 4.29193426, 4.49843267, 4.71486636, 4.94171336,
5.17947468, 5.42867544, 5.68986603, 5.96362332, 6.25055193,
6.55128557, 6.86648845, 7.19685673, 7.54312006, 7.90604321,
8.28642773, 8.68511374, 9.10298178, 9.54095476, 10. ])
8.1.3
那个B的赋值肯定有问题啦,很明显,应该是:B = np.array([n for n in range(4)])
注意比较:
np.repeat(A,2)
array([1, 1, 2, 2, 3, 3, 4, 4])
np.tile(A, 2)
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
参考网页https://docs.scipy.org/doc/scipy/reference/linalg.html
可以发现,linalg并没有提供dot方法,说明这个地方肯定有typo,这地方作者写太快了吧,代码都乱七八糟的,等号写一个就够了,X一开始那个直接solve肯定会有错,solve又没有define,简单吐槽一下,继续。
然后因为没有置随机种子,所以结果正确性有待考证。
稀疏矩阵那里,貌似前面可能有from numpy import *,所以很明显的有一些函数是numpy中的,前面并没有加相关库,注意一下就好。
from scipy import sparse as sp
import numpy as np
A = np.array([[1,0,0],[0,2,0],[0,0,3]])
C = sp.csr_matrix(A)
print(A)
print(C)
print(C.toarray())
print(C * C.todense())
print(np.dot(C,C).todense())
关于后面scipy的优化方法,应该后来仔细学习一下,里面有不少还是跟matlab相同的算法的,练习。
pandas 数据指引重大错误!
这里一开始用的明明是iris.data,但是链接都没有给,我在这里给出真实链接:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
原作者和翻译以及校注是多不用心?
data.describe()的结果:
describe: sepal length sepal width petal length petal width
正是因为这个结果,我们后面打印出的sepal_len_cnt才能是如下输出
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
5.0 10
6.3 9
5.1 9
6.7 8
5.7 8
5.5 7
5.8 7
6.4 7
6.0 6
4.9 6
6.1 6
5.4 6
5.6 6
6.5 5
4.8 5
7.7 4
6.9 4
5.2 4
6.2 4
4.6 4
7.2 3
6.8 3
4.4 3
5.9 3
6.6 2
4.7 2
7.6 1
7.4 1
4.3 1
7.9 1
7.3 1
7.0 1
4.5 1
5.3 1
7.1 1
Name: sepal length, dtype: int64
很明显后面那个也错了,并不是data['Iris-setosa'],而应该是data['Cat']
结果为:
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Cat, dtype: int64
sntsosa[:5]的结果为:
sepal length sepal width petal length petal width Cat
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
下面是道琼斯样例除head和resample的结果:
1453438639
7.621739999999999
DatetimeIndex(['2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
'2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
'2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
'2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28'],
dtype='datetime64[ns]', name='date', freq=None)
Int64Index([ 7, 14, 21, 28, 4, 11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4,
11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4, 11, 18, 25, 4, 11,
18, 25, 7, 14, 21, 28, 4, 11, 18, 25, 4, 11, 18, 25, 7, 14, 21,
28, 4, 11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28, 4, 11, 18, 25,
4, 11, 18, 25, 7, 14, 21, 28, 4, 11, 18, 25, 4, 11, 18, 25, 7,
14, 21, 28, 4, 11, 18, 25, 4, 11, 18, 25, 7, 14, 21, 28],
dtype='int64', name='date')
Int64Index([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2,
3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2,
2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,
2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1],
dtype='int64', name='date')
Int64Index([2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011],
dtype='int64', name='date')
在执行resample的时候,会收到过期提醒:how in .resample() is deprecated
参考网页:http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#resample-api
直接将代码改为:print(stockdata.resample('M').sum())
得到如下结果
quarter volume percent_change_price ... percent_change_next_weeks_price days_to_next_dividend percent_return_next_dividend
date ...
2011-01-31 36 6779916771 19.637287 ... 34.302458 2618 18.519712
2011-02-28 32 5713027799 28.553732 ... -4.583387 1637 13.819996
2011-03-31 32 5535580114 -7.317345 ... 3.263918 1560 13.930990
[3 rows x 8 columns]
后面操作的时候同样也会面临未来过期预警:FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects()
将这里的三行代码改为:
stockdata_new.open = pd.to_numeric(stockdata_new.open.str.replace('$', ""))
stockdata_new.close = pd.to_numeric(stockdata_new.close.str.replace('$', ""))
(stockdata_new.close - stockdata_new.open).infer_objects()
这部分代码最后两个head的结果为:
date
2011-01-07 12.656
2011-01-14 13.368
2011-01-21 12.952
2011-01-28 12.696
2011-02-04 12.944
Name: newopen, dtype: float64
stock open high low close volume newopen
date
2011-01-07 AA 15.82 $16.72 $15.78 16.42 239655616 12.656
2011-01-14 AA 16.71 $16.71 $15.64 15.97 242963398 13.368
2011-01-21 AA 16.19 $16.38 $15.60 15.79 138428495 12.952
2011-01-28 AA 15.87 $16.63 $15.82 16.13 151379173 12.696
2011-02-04 AA 16.18 $17.39 $16.18 17.14 154387761 12.944
丝毫不明白为什么有下面这四行,直接删掉
plt.subplot(2, 2, 3)
plt.plot(x, y, 'g--')
plt.subplot(2, 2, 4)
plt.plot(x, y, 'r-*')
这章会得到四张图:

后面有两个多出来的代码,图像是下面这两张:


1155

被折叠的 条评论
为什么被折叠?



