[Python for Data Anlysis]CH04 Numpy Basics -- Arrays and Vectorized Computation

本文链接：https://blog.youkuaiyun.com/jaskson/article/details/50678049

NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is the fundamental package required for high
performance scientific computing and data analysis.

ndarray
mathematical functions for fast operations on entire arrays of data without having to write loop
Tools for reading data form disk
Linear Algebra, random number generation, Fourier transformation
Tools for interrating code wiritten in C, C++, Fortran

基本设置

%matplotlib inline
from __future__ import division
from numpy.random import randn
import numpy as np
np.set_printoptions(precision=4, suppress=True)

NumPy ndarray: A Multidimensional Array Object

基本使用

data = randn(2, 3)
data *10
data + data
data.shape
data.dtype

Creating ndarray

Array
它能接受任何序列，然后创建一个NumPy array，包含输入的序列
zeros and ones
zeros 和 ones创建对应shape的array，而且分别全为0,1.
empty
empty creats an array without initializing its values to any particular value
arange
arange 将range变为对应的array

#array
data1= [6,7.5,8,0,1]
arr1 = np.array(data1)
#二维序列 nested sequences
data2 = [[1,2,3,4],[5,6,7,8]]
arr2 = np.array(data2)

#zeros， ones
a1 = np.zeros(10)
a2 = np.ones((2,3))

#empty
np.empty(10)

#arange
np.arange(15)

Function	Description
array	Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype. Copies the input data by default.
asarray	Convert input to ndarray, but do not copy if the input is already an ndarray
arange	Like the built-in range but returns an ndarray instead of a list.
ones, ones_like	Produce an array of all 1’s with the given shape and dtype. ones_like takes another array and produces a ones array of the same shape and dtype.
zeros, zeros_like	Like ones and ones_like but producing arrays of 0’s instead
empty, empty_like	Create new arrays by allocating new memory, but do not populate with any values like ones and zeros
eye, identity	Create a square N x N identity matrix (1’s on the diagonal and 0’s elsewhere)

Data Types for ndarrays

主要时用于计算memory大小的,后面数字表示bit位数， double（float）8字节，所以要64bits

arr1 = np.array([1,2,3],dtype = np.float64)
arr2 = np.array([1,2,3],dtype = np.int32)
arr1.dtype
arr2.dtype

casting dtypes between different arrays

类型给定方法：
1. 初始化时默认给定
2. 初始化时给定
3. arr.astype(给定dtype，或这另一个arr2.dtype)
astype always creates a new array，不论类型有没有被改变


#1. 初始化默认给定
arr = np.arange(1,6)
#2. 初始化是给定
numeric_strings = np.array(['1.25','-9.6','42'],dtype = np.string_)
#3. 改变数据类型
float_arr = arr.astype(np.float64) #cast int64 to float64
numeric_strings.astype(float) 
#if cast fail for some reason, a TypeError will be raised,
# Numpy is smart enough to alias Python types to equivalent dtypes

# arr2.dtype
arr1 = np.arange(10)
arr2 = randn(2,3)
arr1.astype(arr2.dtype),arr1.dtype

Operations between Arrays and Scalars

和R， Matlab一致,
所有的*, + ,-，/是对应元素间的操作

arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr
#二元运算符 
arr + arr
arr - arr
arr * arr
arr / arr

#一元运算符
1 / arr
arr ** 0.5

Bacis Indexing and Sclicing

One dimension

Array slices are views on the original array,
and any modifications to the view will be reflected in the source array.

arr = np.arange(10)
arr
arr[5]
arr[5:8]
arr[5:8] = 12
arr

arr_slice = arr[5:8]
arr_slice[1] = 12345
arr
arr_slice[:] = 64
arr

copy of the slice of the array

arr[5:8].copy()
arr_slice_copy = arr[5:8].copy()
arr_slice_copy[1] = 1
arr_slice_copy
arr

Higher Dimension

The elements at each index are no longer scalars but rather corresponding arrays

arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d[2]
arr2d[0][2],arr2d[0,2]

arr3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
arr3d
arr3d.shape
arr3d[0]
arr3d[0] = 42
arr3d[1, 0]

Indexing with slices

view of original array

arr[1:6]
arr2d
# 仅有一个表示行
arr2d[:2]
# 两个则分别表示行和列
arr2d[:2, 1:]
arr2d[1, :2]

Boolean Indexing

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = randn(7, 4)
names
data

names == 'Bob'
data[names == 'Bob'] 
data[names == 'Bob', 2:]
data[names == 'Bob', 3]

mask = (names == 'Bob') | (names == 'Will') 
#do not support keywords and, or
mask
data[mask]

data[data<0] = 0
data
data[names!='Joe'] = 7
data

Fancy Indexing

Indexing using integer arrays

arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

arr[[4, 3, 0, 6]]
arr[[-3,-5,-7]]

arr = np.arange(32).reshape((8, 4))
arr
arr[[1, 5, 7, 2], [0, 3, 1, 2]]
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
arr[np.ix_([1, 5, 7, 2], [0, 3, 1, 2])]

Transposing arrays and swapping axes

arr = np.arange(15).reshape((3, 5))
arr
arr.T

arr = np.random.randn(6, 3)
np.dot(arr.T, arr)

transpose(), swapaxes()暂时用不到

Universal Functions: Element-wise Array Functions

一些快速的函数，element-wise的函数

arr = np.arange(10)
np.sqrt(arr)
np.exp(arr)

参数为多个array

x = randn(8)
y = randn(8)
x
y
np.maximum(x, y) # element-wise maximum

返回多个值

arr = randn(7) * 5
np.modf(arr)

Uinary functions

Function	Description
abs, fabs	Compute the absolute value element-wise for integer, floating point, or complex values. Use fabs as a faster alternative for non-complex-valued data
sqrt	Compute the square root of each element. Equivalent to arr ** 0.5
square	Compute the square of each element. Equivalent to arr ** 2
exp	Compute the exponent e x of each element
log, log10, log2, log1p	Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
sign	Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative)
ceil	Compute the ceiling of each element, i.e. the smallest integer greater than or equal to each element
floor	Compute the floor of each element, i.e. the largest integer less than or equal to each element
rint	Round elements to the nearest integer, preserving the dtype
modf	Return fractional and integral parts of array as separate array
isnan	Return boolean array indicating whether each value is NaN (Not a Number)
isfinite, isinf	Return boolean array indicating whether each element is finite (non- inf , non- NaN ) or infinite, respectively
cos, cosh, sin, sinh, tan, tanh	Regular and hyperbolic trigonometric functions
arccos, arccosh, arcsin, arcsinh, arctan, arctanh	Inverse trigonometric functions
logical_not	Compute truth value of not x element-wise. Equivalent to -arr .

Binary functions

Function	Description
add	Add corresponding elements in arrays
subtract	Subtract elements in second array from first array
multiply	Multiply array elements
divide, floor_divide	Divide or floor divide (truncating the remainder)
power	Raise elements in first array to powers indicated in second array
maximum, fmax	Element-wise maximum. fmax ignores NaN
minimum, fmin	Element-wise minimum. fmin ignores NaN
mod	Element-wise modulus (remainder of division)
copysign	Copy sign of values in second argument to values in first argument

Data processing using arrays

vectorization把loop转换成array expression: faster

Expressing conditional logic as array operations

pure python
result = [x if c else y for x,y,c in zip(x,y,c)

numpy

result = np.where(c,x,y)
arr = randn(4, 4)
arr
np.where(arr > 0, 2, -2)
np.where(arr > 0, 2, arr) # set only positive values to 2

Mathematical and statistical methods

mean

arr = np.random.randn(5, 4) # normally-distributed data
arr.mean()
np.mean(arr)
arr.sum()

按行列，0为列，1 为行
```
arr.mean(axis=1)
arr.sum(0)
```

cumsum， cumprod

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr.cumsum(0)
arr.cumprod(1)

Method	Description
sum	Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0.
mean	Arithmetic mean. Zero-length arrays have NaN mean.
std, var Standard deviation and variance, respectively, with optional degrees of freedom adjust-ment (default denominator n ).
min, max	Minimum and maximum.
argmin, argmax	Indices of minimum and maximum elements, respectively.
cumsum	Cumulative sum of elements starting from 0
cumprod	Cumulative product of elements starting from 1

Methods for boolean arrays

统计正数

arr = randn(100)
(arr > 0).sum() # Number of positive values

是否存在any，是否都all
bools = np.array([False, False, True, False]) bools.any() bools.all()

Sorting

arr.sort()
```
arr = randn(8)
arr
arr.sort()
arr
```
arr.sort(1)
```
arr.sort(1)
```
np.sort()
```
np.sort(arr)
```

Unique and other set logic

np.unique(arr)

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe',       'Joe'])
np.unique(names)
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

np.in1d(arr1,arr2)

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

Method	Description
unique(x)	Compute the sorted, unique elements in x
intersect1d(x, y)	Compute the sorted, common elements in x and y
union1d(x, y)	Compute the sorted union of elements
in1d(x, y)	Compute a boolean array indicating whether each element of x is contained in y
setdiff1d(x, y)	Set difference, elements in x that are not in y
setxor1d(x, y)	Set symmetric differences; elements that are in either of the arrays, but not both

File input and output with arrays

Storing arrays on disk in binary format

arr = np.arange(10)
np.save('some_array', arr)
np.load('some_array.npy')

np.savez('array_archive.npz', a=arr, b=arr)
arch = np.load('array_archive.npz')
arch['b'] #dict-like

Saving and loading text files

pandas里面的read_csv和read_table 较为常用

arr = np.loadtxt('array_ex.txt', delimiter=',')
arr

Linear algebra

from numpy.linalg import inv, qr
1. A %*% B
“`python
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
y
x.dot(y) # equivalently np.dot(x, y)

```

2. QR分解
“`
from numpy.linalg import inv, qr
X = randn(5, 5)
mat = X.T.dot(X)
inv(mat)
mat.dot(inv(mat))
q, r = qr(mat)
r

Function	Description
diag	Return the diagonal (or off-diagonal) elements of a square matrix as a 1D array, or
dot	Matrix multiplication
trace	Compute the sum of the diagonal elements
det	Compute the matrix determinant
eig	Compute the eigenvalues and eigenvectors of a square matrix
inv	Compute the inverse of a square matrix
pinv	Compute the Moore-Penrose pseudo-inverse inverse of a square matrix
qr	Compute the QR decomposition
svd	Compute the singular value decomposition (SVD)
solve	Solve the linear system Ax = b for x, where A is a square matrix
lstsq	Compute the least-squares solution to y = Xb

Random number generation

samples = np.random.normal(size=(4, 4))
samples

from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0, 1) for _ in xrange(N)]
%timeit np.random.normal(size=N)

Function	Description
seed	Seed the random number generator
permutation	Return a random permutation of a sequence, or return a permuted range
shuffle	Randomly permute a sequence in place
rand	Draw samples from a uniform distribution
randint	Draw random integers from a given low-to-high range
randn	Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface)
binomial	Draw samples a binomial distribution
normal	Draw samples from a normal (Gaussian) distribution
beta	Draw samples from a beta distribution
chisquare	Draw samples from a chi-square distribution
gamma	Draw samples from a gamma distribution
uniform	Draw samples from a uniform [0, 1) distribution

Example: Random Walks

pure python

import random
position = 0
walk = [position]
steps = 1000
for i in xrange(steps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)

numpy

np.random.seed(12345)
nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk = steps.cumsum()

初探random walk
walk.min()
walk.max()
找出初次到达10或-10的时刻

(np.abs(walk)>=10).argmax()

Simulating many random walks at once

nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1) #对行求和
walks

初探random walk

walks.max()
walks.min()

hits30 = (np.abs(walks) >= 30).any(1)
hits30
hits30.sum() # Number that hit 30 or -30

crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()

正态分布 random walk

steps = np.random.normal(loc=0, scale=0.25,
                         size=(nwalks, nsteps))