python数据处理复习

Snowbooo

已于 2024-11-20 12:00:44 修改

阅读量712

点赞数 30

分类专栏：数据分析文章标签： python

于 2024-10-20 15:14:34 首次发布

本文链接：https://blog.youkuaiyun.com/inn40/article/details/142894546

版权

数据分析专栏收录该内容

3 篇文章

订阅专栏

仅供自己复习

1. python数据处理

1.1元组（tuple）

元组是一个固定长度，不可改变的python序列对象

一旦创建了元组，元组中的对象就不能修改了。

但如果元组中的某个对象是可变的，则可以在原位进行修改

1.1.1元组——拆包

In [18]: tup = (4, 5), 6, 7, 8

In [19]: (a, b), *rest= tup
In [20]: b, *rest

Out[20]: (5, 6, 7, 8)

#将元组赋值给类似元组的变量

In [21]: a, b = 1, 2

In [24]: b, a = a, b
In [25]: a

Out[25]: 2

#替换变量的名字本质也是拆包

In [27]: seq = [(1, 2, 3), (4, 5, 6)]
In [28]: for a, b, c in seq:

print('a={0}, b={1}, c={2}'.format(a, b, c))

Out[28]: a=1, b=2, c=3

a=4, b=5, c=6

#遍历元组或列表也是拆包

1.2列表（list）

1.2.1列表——排序sort

In [61]: a = [7, 2, 5, 1, 3]
In [62]: a.sort()
In [63]: a

Out[63]: [1, 2, 3, 5, 7]

In [64]: b = ['saw', 'small', 'He', 'foxes', 'six']
In [65]: b.sort(key=len)
In [66]: b

Out[66]: ['He', 'saw', 'six', 'small', 'foxes']

1.2.2列表——列举enumerate

enumerate函数，可以返回 (i, value) 元组序列：

some_list = ['foo', 'bar', 'baz']

mapping = {}

for i, v in enumerate(some_list):

mapping[v] = i

mapping

Out：{'foo': 0, 'bar': 1, 'baz': 2}

1.2.3列表——拉链zip

zip可以将多个列表、元组或其它序列组合成一个元组列表：

In [89]: seq1 = ['foo', 'bar', 'baz']
In [90]: seq2 = ['one', 'two', 'three']

In [91]: zipped = zip(seq1, seq2)
In [92]: list(zipped)

Out[92]: [('foo', 'one'), ('bar', 'two'), ('baz', 'three’)]

zip可以处理任意多的序列，元素的个数取决于最短的序列：

In [93]: seq3 = [False, True]
In [94]: list(zip(seq1, seq2, seq3))Out[94]: [('foo', 'one', False), ('bar', 'two', True)], seq2)

zip可以被用来智能拆分序列，或者把行的列表转换为列的列表

In [96]: pitchers = [('Nolan', 'Ryan’),

('Roger', 'Clemens'),

('Schilling', 'Curt')]
In [97]: first_names, last_names = zip(*pitchers)
In [98]: first_names

Out[98]: ('Nolan', 'Roger', 'Schilling‘)

在需要索引的时候，结合enumerate使用：

In [89]: seq1 = ['foo', 'bar', 'baz']
In [90]: seq2 = ['one', 'two', 'three']

In [95]: for i, (a, b) in enumerate(zip(seq1, seq2)):

....: print('{0}: {1}, {2}'.format(i, a, b))

0: foo, one

1: bar, two

2: baz, three

1.3 python函数语法——推导式

1.3.1 集合的推导式

set_comp = {expr for value in collection if condition}

如统计有哪些字母数：

In [154]: strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

In [156]: unique_lengths = {len(x) for x in strings}
In [157]: unique_lengths

Out[157]: {1, 2, 3, 4, 6}

1.3.2 列表的推导式：

[expr for val in collection if condition]

如让所有名词的字母大写：

In [154]: strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

In [155]: [x.upper() for x in strings if len(x) > 2]

Out[155]: ['BAT', 'CAR', 'DOVE', 'PYTHON’]

1.3.3 字典的推导式：

dict_comp = {key-expr : value-expr for value in collection if condition}

如创建一个字符串的查找映射表，以确定字符串在列表中的位置：

In [154]: strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

In [159]: loc_mapping = {val : index for index, val in enumerate(strings)}
In [160]: loc_mapping

Out[160]: {'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

%1.3.4 嵌套的推导式：

如将一个整数元组的列表扁平化成一个整数列表：

In [164]: some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

In [164]: flattened = []
In [164]: for tup in some_tuples:

In [164]: for x in tup:

In [164]: flattened.append(x)

In [164]: some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
In [165]: flattened = [x for tup in some_tuples for x in tup]
In [166]: flattened

Out[166]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

%1.3.5函数的优点：

可以减少重复编写程序段的工作量，提高程序可读性。

提高程序编译和运行效率，产生质量较高的目标代码。

能够实现较快的执行速度，能够减少网络流量，能够减少内存占用。

%1.4 python文件操作

2. numpy基础

2.1 创建numpy

import numpy as np

语法	功能
numpy.array(list/tuple，[dtype = numpy.float32])	利用列表、元组等类型创建ndarray数组，默认类型为整型
numpy.arange(start, stop, step, dtype)	创建数值范围并返回 ndarray 对象
numpy.linspace(start, stop, number, dtype)	创建数值范围并返回 ndarray 对象

arange是Python内置函数range的数组版

In [32]: np.arange(10)

Out[32]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [32]: np.arange(10).reshape((2,5))

Out[32]:

[[0 1 2 3 4]

[5 6 7 8 9]]

linspace是arange的个数版

In [32]: np.linspace(1,10,4)

Out[32]: array([ 1., 4., 7., 10.])

In [32]: np.linspace(1,10,4,endpoint=False)

Out[32]: array([1. , 3.25, 5.5 , 7.75])

2.1.1创建ndarray——属性

属性	说明
ndarray.ndim	秩，即轴的数量或维度的数量
ndarray.shape	数组的维度，对于矩阵，n 行 m 列
ndarray.size	数组元素的总个数，相当于 .shape 中 n*m 的值
ndarray.dtype	ndarray 对象的元素类型

array([[[11, 12],

        [13, 14]],

       [[21, 22],

        [23, 24]],

       [[31, 32],

        [33, 34]]])

In [17]: a.ndim

Out[17]: 3

In [17]: a.shape

Out[17]: (3, 2, 2)

In [17]: a.shape[0]

Out[17]: 3

In [17]: a.size

Out[17]: 12

2.1.2 创建ndarray——concatenate

语法

参数

功能

numpy.concatenate((a,b),axis)

a,b数组

axis拼接的维度

将组合a，b等shape一致的数组组合成新的数组

numpy.empty(shape,

dtype = float, order = ‘C’)

order 有"C"和"F"两个选项,分别代表行优先和列优先

创建一个指定形状（shape）、数据类型（dtype）

且未初始化的数组

In [32]: a=np.linspace(1,10,4)

In [32]: b=np.linspace(1,10,4,endpoint=False)

In [32]: np.concatenate([a,b])

Out[32]: array([ 1. , 4. , 7. , 10. , 1. , 3.25, 5.5 , 7.75])

In [32]: np.concatenate((a,b) ,axis=0)

Out[32]: array([ 1. , 4. , 7. , 10. , 1. , 3.25, 5.5 , 7.75])

2.1.3 创建ndarray——zeros，full，eye

zeros可以创建指定长度或形状的全0数组

In [29]: np.zeros(10)

Out[29]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [30]: np.zeros((3, 6))

Out[30]:

array([[ 0., 0., 0., 0., 0., 0.],

[ 0., 0., 0., 0., 0., 0.],

[ 0., 0., 0., 0., 0., 0.]])

语法	参数	功能
numpy.full(shape,val)	shape-元组类型，val-指定值	根据shape生成一个数组，每个元素值都是val
numpy.full_like(a,val)	a-已知数组;val-给定数值	根据数组a的形状生成一个数组，每个元素值都是val
numpy.eye(n)	n-矩阵的行列数	创建一个正方的n*n单位矩阵，对角线为1，其余为0

In [32]: b=np.full((3,5), 10)

Out[32]:

[[10 10 10 10 10]

[10 10 10 10 10]

[10 10 10 10 10]

2.1.4 ndarray维度变换

语法	参数	功能
numpy.reshape(arr,newshape)	arr：要修改形状的数组 newshape：整数或者整数数组，新的形状应当兼容原有形状	根据newshape生成一个新数组，在不改变数据的条件下修改形状
numpy.resize(arr, shape)	arr：要修改大小的数组 shape：返回数组的新形状	和reshape一样，但修改原数组
arr.flatten(order)	arr：原数组	打平数组生成新数组，所做的修改不会影响原始数组

reshape生成newshape数组

a=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

a

Out:

array([[11, 12, 13, 14],

       [21, 22, 23, 24],

       [31, 32, 33, 34]])

a.reshape(2,3,2)

Out:

array([[[11, 12],

        [13, 14],

       [21, 22]],

      [[23, 24],

       [31, 32],

       [33, 34]]])

resize修改newshape数组

a=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

a

Out:

array([[11, 12, 13, 14],

       [21, 22, 23, 24],

       [31, 32, 33, 34]])

a.resize(2,2,3)

Out:

a

Out:

array([[[11, 12, 13],

        [14, 21, 22]],

       [[23, 24, 31],

        [32, 33, 34]]])

flatten打平为一维数组

a=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

a

Out:

array([[11, 12, 13, 14],

[21, 22, 23, 24],

[31, 32, 33, 34]])

a.flatten()

Out:

array([11, 12, 13, 14, 21, 22, 23, 24, 31, 32, 33, 34])

2.1.5 ndarray维度变换——转置

语法	参数	功能
numpy.T		数组的高低维度转置
numpy.transpose(axis1, axis2)	axis1：axis1轴变成新位置的轴 axis2：axis2轴变成新位置的轴	高维数组通过轴对换来对多个维度进行变换。

T转置矩阵

a=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

a

Out:

array([[11, 12, 13, 14],

       [21, 22, 23, 24],

       [31, 32, 33, 34]])

a.T

Out:

array([[11, 21, 31],

       [12, 22, 32],

       [13, 23, 33],

       [14, 24, 34]])

a=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

b=a.reshape(3,2,2)

b

Out:

array([[[11, 12],

        [13, 14]],

       [[21, 22],

        [23, 24]],

       [[31, 32],

        [33, 34]]])

b.T

Out:

array([[[11, 21, 31],

        [13, 23, 33]],

       [[12, 22, 32],

        [14, 24, 34]]])

transpose高维转置矩阵，征用原来的轴

a=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

b=a.reshape(3,2,2)

b

Out:

array([[[11, 12],

        [13, 14]],

       [[21, 22],

        [23, 24]],

       [[31, 32],

        [33, 34]]])

b.transpose(2,1,0)

Out:

array([[[11, 21, 31],

        [13, 23, 33]],

       [[12, 22, 32],

        [14, 24, 34]]])

b.transpose(0,2,1)

Out:

array([[[11, 13],

        [12, 14]],

       [[21, 23],

        [22, 24]],

       [[31, 33],

        [32, 34]]])

b.transpose(1,2,0)

Out:

array([[[11, 21, 31],

        [12, 22, 32]],

       [[13, 23, 33],

        [14, 24, 34]]])

2.2 numpy数组算法

2.2.1 四则运算

In [51]: arr = np.array([[1., 2., 3.], [4., 5., 6.]])
In [52]: arr

Out[52]:

array([[ 1., 2., 3.],

[ 4., 5., 6.]])

In [53]: arr * arr

Out[53]:

array([[ 1., 4., 9.],

[ 16., 25., 36.]])
In [54]: arr - arr

Out[54]:

array([[ 0., 0., 0.],

[ 0., 0., 0.]])

数组与标量的算术运算会将标量值传播到各个元素：

In [55]: 1 / arr

Out[55]:

array([[ 1. , 0.5 , 0.3333],

[ 0.25 , 0.2 , 0.1667]])

In [56]: arr ** 0.5

Out[56]:

array([[ 1. , 1.4142, 1.7321],

[ 2. , 2.2361, 2.4495]])

大小相同的数组之间的比较会生成布尔值数组：

In [51]: arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [57]: arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

In [59]: arr2 > arr

Out[59]:

array([[False, True, False],

[ True, False, True]], dtype=bool)

NumPy 广播(Broadcast)

广播(Broadcast)是 numpy 对不同形状(shape)的数组进行数值计算的方式，对数组的算术运算通常在相应的元素上进行。

如果两个数组 a 和 b 形状相同，即满足 a.shape == b.shape，那么 a*b 的结果就是 a 与 b 数组对应位相乘。这要求维数相同，且各维度的长度相同。

import numpy as np

a = np.array([1,2,3,4])
b = np.array([10,20,30,40])
c = a * b
print (c)

#[ 10 40 90 160]

当运算中的 2 个数组的形状不同时，numpy 将自动触发广播机制。如：

import numpy as np

a = np.array([[ 0, 0, 0],
[10,10,10],
[20,20,20],
[30,30,30]])
b = np.array([0,1,2])
print(a + b)

#[[ 0 1 2]
[10 11 12]
[20 21 22]
[30 31 32]]

2.2.2 索引切片

https://www.runoob.com/numpy/numpy-indexing-and-slicing.html

ppt题目

arr[5:8:2] = 12

arr

Out: array([ 0, 1, 2, 3, 4, 12, 6, 12, 8, 9])

In [72]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
In [73]: arr2d[-1]

Out[73]: array([7, 8, 9])

In [74]: arr2d[2][1]

Out[74]: 8

In [75]: arr2d[1, 2]

Out[75]: 6

In [76]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]],

[[7, 8, 9], [10, 11, 12]]])

In [84]: arr3d[:1]

Out[84]: array([[[1, 2, 3], [4, 5, 6]]])

In [84]: arr3d[:, :1]

Out[84]: array([[[1, 2, 3]], [[7, 8, 9]]])

In [84]: arr3d[:2, 1:]

Out[84]: array([[[ 4, 5, 6]], [[10, 11, 12]]])

In [84]: arr3d[:2, 1:] = 0

In [84]: arr3d

Out[84]: array([[[1, 2, 3], [0, 0, 0]], [[7, 8, 9], [0, 0, 0]]])

In [84]: arr3d[:, 1, 0]

Out[84]: array([ 4, 10])

In [84]: arr3d[:, 1:2, 0:1]

Out[84]: array([[[ 4]],

[[10]]])

In [84]: arr3d[:, 1, 0]

Out[84]: array([ 4, 10])

In [84]: arr3d[1, :, 0]

Out[84]: array([ 7, 10])

In [84]: arr3d[1, 0, :]

Out[84]: array([7, 8, 9])

In [84]: arr3d[1, :2]

Out[84]: array([[ 7, 8, 9], [10, 11, 12]])

In [84]: arr3d[:2, 0]

Out[84]: array([[1, 2, 3], [7, 8, 9]])

2.2.3 布尔索引

and和or在布尔型数组中无效。要使用&与|

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will'])

data = np.arange(1,6).reshape((5,1))*10+np.arange(1,5)

names == 'Bob'

Out: array([ True, False, False, True, False])

data[names == 'Bob']

Out: array([[11, 12, 13, 14],

            [41, 42, 43, 44]])

data[names == 'Bob', 2:]

Out: array([[13, 14],

          [43, 44]])

mask = (names == 'Bob') | (names == 'Will')

data[mask]

Out: array([[11, 12, 13, 14],

            [31, 32, 33, 34],

            [41, 42, 43, 44],

            [51, 52, 53, 54]])

data[data%10 < 2] = 0

data

Out: array([[ 0, 12, 13, 14],

            [ 0, 22, 23, 24],

            [ 0, 32, 33, 34],

            [ 0, 42, 43, 44],

            [ 0, 52, 53, 54]])

data[names == 'Joe'] = 7.5

data

Out: array([[ 0, 12, 13, 14],

            [ 7, 7, 7, 7],#由于 data 数组的类型是整数类型，7.5 会被截断为 7。

            [ 0, 32, 33, 34],

            [ 0, 42, 43, 44],

            [ 0, 52, 53, 54]])

2.2.4 神奇索引

神奇索引利用整数数组进行索引，数组内的元素代表序号

arr = np.empty((5, 4))

for i in range(5):

arr[i] = i

arr

Out: array([[0., 0., 0., 0.],

            [1., 1., 1., 1.],

            [2., 2., 2., 2.],

            [3., 3., 3., 3.],

            [4., 4., 4., 4.]])

arr[[4, 3, 0, 2]]#~~arr[4, 3, 0, 2]报错~~

Out：

array([[4., 4., 4., 4.],

       [3., 3., 3., 3.],

       [0., 0., 0., 0.],

       [2., 2., 2., 2.]])

arr=np.arange(1,4).reshape((3,1))*10+np.arange(1,5)

arr

#通过np.arange(1, 4)生成一个包含从1到3的数组 [1, 2, 3]，然后使用reshape((3, 1))将其重塑为一个3行1列的二维数组。接下来，我们将这个数组乘以10，得到一个新的数组 [[10], [20], [30]]。然后，我们再加上np.arange(1, 5)生成的数组 [1, 2, 3, 4]，这会触发NumPy的广播机制，使得每一行都加上 [1, 2, 3, 4]，最终得到一个3行4列的数组 arr：

Out：array([[11, 12, 13, 14],

            [21, 22, 23, 24],

            [31, 32, 33, 34]])

arr[[1, 2, 0, 2], [0, 3, 1, 2]]

Out：array([21, 34, 12, 33])

#高级索引（fancy indexing），其中第一个列表 [1, 2, 0, 2]表示行索引，第二个列表 [0, 3, 1, 2]表示列索引。具体来说，这个操作会选择以下元素：

arr[1, 0] 对应元素 21
arr[2, 3] 对应元素 34
arr[0, 1] 对应元素 12
arr[2, 2] 对应元素 33

arr[[1, 2, 0, 2]][:, [0, 3, 1, 2]]

Out：

array([[21, 24, 22, 23],

   [31, 34, 32, 33],

[11, 14, 12, 13],

   [31, 34, 32, 33]])

#: 表示选择所有行，而 [0, 3, 1, 2] 是一个列索引列表，表示我们要选择第0列、第3列、第1列和第2列

arr[[1, 2, 0, 2]]

out：

array([[21, 22, 23, 24],

         [31, 32, 33, 34],

         [11, 12, 13, 14],

         [31, 32, 33, 34]])

2.3函数语法

2.3.1 一元通用函数

函数名	描述
abs	逐个元素地计算整数、浮点数或复数地绝对值
sqrt	计算每个元素的平方根(与arr ** 0.5相等)
square	计算每个元素地平方(与arr ** 2相等)
exp	计算每个元素的自然指数值e^x次方
log、log10、log2、log1p	分别对应(自然指数(e为底)、对数10为底、对数2为底、log(1+x))

2.3.2 二元通用函数

函数名	描述
add	将数组的对应元素相加
subtract	在第二个数组中，将第一个数组中包含的元素去除
multiply	将数组的对应元素相乘
divide, floor_divide	除或整除(放弃余数)
power	将第二个数组的元素作为第一个数组对应元素的幂次方
maximum	逐个元素计算最大值，fmax忽略NaN
minimum	逐个元素计算最小值，fmin忽略NaN

2.3.3 where

numpy.where(condition, x, y)

是三元表达式x if condition else y的矢量化版本

a = np.arange(10)

np.where(a, 1, -1)

Out:

array([-1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

np.where(a > 5, 1, -a)

Out:

array([ 0, -1, -2, -3, -4, -5, 1, 1, 1, 1])

np.where([[True, False], [True, True]],

         [[1, 2],[3, 4]], [[9, 8], [7, 6]])

Out：

array([[1, 8], [3, 4]])

只有条件 (condition)，没有x和y，则输出满足非0 元素的坐标

a = np.array([2, 4, 6, 8, 10])

np.where(a > 5)

Out：

(array([2, 3, 4], dtype=int64),)

a[np.where(a > 5)]

Out：

array([ 6, 8, 10])

np.where([[0, 1], [1, 0]])

Out：

(array([0, 1], dtype=int64), array([1, 0], dtype=int64))

a = np.arange(12).reshape(2,2,3)

a

Out：

array([[[ 0, 1, 2],

        [ 3, 4, 5]],

       [[ 6, 7, 8],

        [ 9, 10, 11]]])

np.where(a > 5)

Out：

(array([1, 1, 1, 1, 1, 1], dtype=int64),

array([0, 0, 0, 1, 1, 1], dtype=int64),

array([0, 1, 2, 0, 1, 2], dtype=int64))

#这表示在数组 a 中，满足条件 a > 5 的元素的索引分别是 (1, 0, 0), (1, 0, 1), (1, 0, 2), (1, 1, 0), (1, 1, 1) 和 (1, 1, 2)。这些索引对应的元素分别是 6, 7, 8, 9, 10, 11。

2.3.4 统计函数

函数名	描述
sum	对数组中的全部或沿着轴向的元素求和
mean、median	求数组的算术平均值、中位数
std、var	分别为标准差和方差
min、max	最小值和最大值

arr = np.arange(12).reshape(3, 4)

arr

Out：

array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],

[ 8, 9, 10, 11]])

arr.mean()

Out：

5.5

np.mean(arr)

Out：

5.5

arr.mean(axis=1)#行

Out：

array([1.5, 5.5, 9.5])

arr.sum(axis=0)#列

Out：

array([12, 15, 18, 21])

2.3.5 排序sort

arr = np.random.randn(3, 5)

arr

Out：array([[-0.8388, 0.4352, -0.5578, -0.5675, -0.3726],

           [-0.9266, 1.7551, 1.2098, 1.27 , -0.9744],

           [-0.6347, -0.3957, -0.2894, -0.7343, -0.7285]])

arr.sort() / np.sort(arr)

arr

Out：

array([[-0.8388, -0.5675, -0.5578, -0.3726, 0.4352],

       [-0.9744, -0.9266, 1.2098, 1.27 , 1.7551],

       [-0.7343, -0.7285, -0.6347, -0.3957, -0.2894]])

就地排序则会修改数组本身，默认对行1统计arr.sort(1)

对列排序：

arr.sort(0)#等效np.sort(arr,axis=0)

arr

Out：

array([[-0.9744, -0.9266, -0.6347, -0.3957, -0.2894],

       [-0.8388, -0.7285, -0.5578, -0.3726, 0.4352],

       [-0.7343, -0.5675, 1.2098, 1.27 , 1.7551]])

如何对所有数排序？

np.sort(np.sort(arr,0),1)

Out：

array([[-0.9744, -0.9266, -0.6347, -0.3957, -0.2894],

       [-0.8388, -0.7285, -0.5578, -0.3726, 0.4352],

       [-0.7343, -0.5675, 1.2098, 1.27 , 1.7551]])

2.4 高等数学

2.4.1 线性代数——solve

solve

求解线性矩阵方程，需要矩阵为方阵，求解的结果为方程准确解（np.linalg）

2.4.2 伪随机数

函数名称	函数功能	参数说明
rand(d0, …, dn)	均匀分布的随机数	dn为第n维数据的维度
randn(d0, …, dn)	标准正态分布随机数	dn为第n维数据的维度
randint([low, high, size,dtype])	随机整数	low：最小值；high：最大值；size：数据个数
random_sample([size])	[0,1）内的随机数	size：随机数的shape，可以为元祖或者列表
permutation、shuffle	返回序列的一个随机排列	多维矩阵按照第一维打乱

np.random.rand(2,4)

Out:

array([[0.4786, 0.0501, 0.4536, 0.1171],

[0.8101, 0.3879, 0.2795, 0.1238]])

np.random.randn(2,4)

Out:

array([[ 0.7203, 0.3809, 1.0034, -2.3156],

[ 0.4572, -0.0259, -3.3993, -0.9747]])

np.random.randint(1,100,[2,4])

Out:

array([[84, 28, 27, 88], [56, 19, 42, 77]])

3. Pandas基础

3.1 Series和DataFrame

3.1.1 创建Series

Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。

import pandas as pd

import numpy as np

1、通过list创建

s1 = pd.Series([4, 7, -5, 3])

s1

Out:

0    4

1    7

2   -5

3    3

dtype: int64

2、通过数组创建

s2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

s2

Out:

d    4

b    7

a   -5

c    3

dtype: int64

3、通过字典创建

dict = {'name':'小明','age':18,'class':'三班'}

s3 = pd.Series(dict,index = ['name','age','class','sex'])

s3

Out：

name 小明

age       18

class     三班

sex       NaN

dtype: object

3.1.2 Series用法

1、isnull 和 notnull 检查缺失值

s3

out:

name 小明

age       18

class     三班

sex       NaN

dtype: object

s3.isnull()

Out：

name     False

age      False

class    False

sex       True

dtype: bool

s3.notnull()

Out：

name      True

age       True

class     True

sex      False

dtype: bool

2、查找索引

s2

Out:

a    1

b    2

c    3

d    4

e    5

dtype: int32

2 in s2

Out：

False

'b' in s2

Out：

True

3、通过索引获取多个数据

s3

Out：

name      李珂

age       18

class     三班

sex       NaN

dtype: object

s3[['name','age']] #等效于s3[[0,1]]

Out：

name    李珂

age     18

dtype: object

s3['name':'class'] #等效于s3[1:3]

Out：

name     李珂

age      18

class    三班

dtype: object

4、通过布尔索引获取数据

s2

Out:

a    1

b    2

c    3

d    4

e    5

dtype: int32

s2>3

Out：

a    False

b    False

c    False

d     True

e     True

dtype: bool

s2[s2>3]

Out：

d    4

e    5

dtype: int32

5、索引与数据的对应关系不被运算结果影响

s2+2

Out：

a    3

b    4

c    5

d    6

e    7

dtype: int32

s2[s2>3]

Out：

d    4

e    5

dtype: int32

6、name属性

s2.name = 'temp' #对象名

s2.index.name = 'year' #对象索引名

s2

Out：

year

a    1

b    2

c    3

d    4

e    5

Name: temp, dtype: int32

7、浏览数据

s2.head()

Out：

year

a    1

b    2

c    3

d    4

e    5

Name: temp, dtype: int32

s2.tail(2)

Out：

year

d    4

e    5

Name: temp, dtype: int32

3.1.3 创建DataFrame

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。

DataFrame中的数据是以一个或多个二维块存放的（而不是列表、字典或别的一维数据结构）。

1. Series构成的字典构造dataframe

pd1 = pd.DataFrame({'a':pd.Series(np.arange(3)),

'b':pd.Series(np.arange(3,5))})

pd1

Out:

a b

0 0 3.0

1 1 4.0

2 2 NaN

2. 字典构成的字典构造dataframe

data1 = {'a':{'apple':3.6,'banana':5.6},

       'b':{'apple':3,'banana':5},

       'c':{'apple':3.2}}

pd2 = pd.DataFrame(data1)

pd2

Out:

a b c

apple 3.6 3 3.2

banana 5.6 5 NaN

3. 2D的ndarray构造dataframe

arr1 = np.arange(12).reshape(4,3)

pd3 = pd.DataFrame(arr1)

pd3

Out:

    0 1 2

0 0 1 2

1 3 4 5

2 6 7 8

3 9 10 11

4. 字典构成的列表构造dataframe

l1 = [{'apple':3.6,'banana':5.6},{'apple':3.2}]

pd4 = pd.DataFrame(l1)

pd4

Out:

apple banana

0 3.6 5.6

1 3.2 NaN

5. Series构成的列表构造dataframe

l2 = [pd.Series(np.arange(3)),pd.Series(np.arange(2))]

pd5 = pd.DataFrame(l2)

pd5

Out:

    0    1 2

0 0.0 1.0 2.0

1 0.0 1.0 NaN

3.1.4 DataFrame用法

1. 基于numpy的算法

pd5 = pd.DataFrame(np.arange(9).reshape(3,3),

index=['a','c','b'],columns=['A','B','C'])

pd5

Out:

A B C

a 0 1 2

c 3 4 5

b 6 7 8

2. 浏览数据

pd5.head(1)

3. 移动数据

pd.DataFrame(pd5,index=('a','b','c'),columns=('A','B','D'))

Out:

A B D

a 0 1 NaN

b 6 7 NaN

c 3 4 NaN

4. name属性

pd5.name='temp'

pd5.index.name = 'year'

pd5.columns.name = 'state'

pd5

Out:

state A B C

year

a 0 1 2

c 3 4 5

b 6 7 8

3.2 Pandas索引操作

3.2.1 重新索引——reindex

1. reindex创建一个符合新索引的新对象

ps1

Out:

a    0

b    1

c    2

d    3

e    4

dtype: int64

ps2 = ps1.reindex(['a','d','b','c','e','f'])

ps2

Out:

a    0.0

d    3.0

b    1.0

c    2.0

e    4.0

f    NaN

dtype: float6

pd1

Out:

A B C

a 0 1 2

b 3 4 5

c 6 7 8

#默认行索引重建

pd2 = pd1.reindex(['a','c','d','b'])

pd2

Out:

A B C

a 0.0 1.0 2.0

c 6.0 7.0 8.0

d NaN NaN NaN

b 3.0 4.0 5.0

pd3 = pd1.reindex(columns =

              ['C','B','A'])

pd3

Out:

C B A

a 2 1 0

b 5 4 3

c 8 7 6

pd4 = pd1.reindex(['a','c','d','b'],

          columns=['A','C','B'])

pd4

Out:

A C B

a 0.0 2.0 1.0

c 6.0 8.0 7.0

d NaN NaN NaN

b 3.0 5.0 4.0

2、增加索引

ps1

Out:

a    0

b    1

c    2

d    3

e    4

dtype: int64

ps1['g'] = 9

ps1

Out:

a    0

b    1

c    2

d    3

e    4

g    9

dtype: int64

3.删除索引

drop和del方法都能够删除dataframe中的列数据，区别:

Ø drop是pandas的内置函数，del是Python的内置函数；

Ø drop对列和行都进行操作，del仅对列进行操作；

Ø drop一次可以处理多个项目，del一次只能操作一个；

pd1

Out:

E A B C 4

a 9 0 1 2 10

b 99 3 4 5 11

c 999 6 7 8 12

d 1 1 1 1 1

#只能列操作

del pd1[4] / ['4']报错

pd1

Out:

E A B C

a 9 0 1 2

b 99 3 4 5

c 999 6 7 8

d 1 1 1 1

#默认行操作

pd1.drop(['a','d']) == pd1.drop(['a','d']，axis=0)

Out:

E A B C

b 99 3 4 5

c 999 6 7 8

pd1.drop('A', axis=1) == pd1.drop('A',axis='columns')

Out:

E B C

a 9 1 2

b 99 4 5

c 999 7 8

d 1 1 1

4.查找索引

5.高级索引

ps1

Out:

a    888

b      1

c      2

d      3

e      4

dtype: int64

ps1['a':'c'] = ps1.loc['a':'c'] = ps1.iloc[0:3]

Out:

a    888

b      1

c      2

dtype: int64

pd1

Out:

A B C

a 0 1 2

b 3 4 5

c 6 7 8

pd1.loc['a',['B','C']] = pd1.iloc[0,1:] = pd1.iloc[0,[1,2]]

Out:

B    1

C    2

Name: a, dtype: int32

3.2.2 修改Pandas

1.索引

ps1

Out:

a    0

b    1

c    2

d    3

e    4

dtype: int64

ps1['a'] = 999 = ps1[0] = 999

ps1

Out:

a    999

b      1

c      2

d      3

e      4

dtype: int64

pd1

Out:

A B C

a 0 1 2

b 3 4 5

c 6 7 8

#默认列

pd1['A'] = [9,10,11]

pd1

Out:

A B C

a 9 1 2

b 10 4 5

c 11 7 8

2.对象

pd1.A = 6

pd1

Out:

A B C

a 6 1 2

b 6 4 5

c 6 7 8

3.新增队列

pd1['D'] = [1,2,3]

pd1

Out:

A B C D

a 0 1 2 1

c 3 4 5 2

b 6 7 8 3

3.3 Pandas函数

3.3.1 对齐运算

方法

方法	描述
A.add(B)，B.radd(A)	加法（+）
A.sub(B)，B.rsub(A)	减法（-）
A.div(B)，B.rdiv(A)	除法（/）
A.floordiv(B)，B.rfllordiv(A)	整除（//）
A.mul(B)，B.rmul(A)	乘法（*）
A.pow(B)，B.rpow(A)	幂次方（**）

3.3.2 混合运算

arr-arr[:,0]

arr-arr[:,:0]

arr-arr[:,:1]

Out:

array([[0, 1, 2, 3],

       [0, 1, 2, 3],

       [0, 1, 2, 3]])

例: arr - arr[:, 0]

这行代码从数组 arr 的每一列中减去数组的第一列元素。具体步骤如下：

arr[:, 0] 表示数组 arr 的第一列元素，即 [0, 4, 8]。

逐元素操作的结果如下：

第一列：[0, 4, 8] - [0, 4, 8] = [0, 0, 0]
第二列：[1, 5, 9] - [0, 4, 8] = [1, 1, 1]
第三列：[2, 6, 10] - [0, 4, 8] = [2, 2, 2]
第四列：[3, 7, 11] - [0, 4, 8] = [3, 3, 3]

df1

Out:

A B C

a 0 1 2

b 3 4 5

c 6 7 8

d 9 10 11

s3 =df1.iloc[0]

s3

Out:

A    0

B    1

C    2

Name: a, dtype: int32

df1-s3

Out:

A B C

a 0 0 0

b 3 3 3

c 6 6 6

d 9 9 9

s4 = df1['A']

s4

Out:

a    0

b    3

c    6

d    9

Name: A, dtype: int32

df1.sub(s4,axis=0)

Out:

A B C

a 0 1 2

b 0 1 2

c 0 1 2

d 0 1 2

3.3.3 函数映射

boolean=[True,False]

gender=["男","女"]

data=pd.DataFrame({

"height":np.random.randint(150,190,5),

"weight":np.random.randint(40,90,5),

"smoker":[boolean[x] for x in np.random.randint(0,2,5)],

"gender":[gender[x] for x in np.random.randint(0,2,5)],

"age":np.random.randint(16,24,5)})

data

df

Out：

A B C

0 -1.401923 -0.909947 4.132582

1 -0.842740 0.669914 1.435051

2 -0.572756 -1.228377 0.911506

df.applymap(lambda x:"%.2f" % x**2)

Out:

A B C

0 1.97 0.83 17.08

1 0.71 0.45 2.06

2 0.33 1.51 0.83

3.3.4 排序处理

方法	描述
A.sort_index()	根据指定某列或某几列对行排序
A.sort_values()	既可以根据列数据，也可根据行数据排序。指定by参数，即指定哪几行或哪几列；无法根据index和columns名排序。
A.rank()	沿着某个轴（0或者1）计算对象的排名（名次值从1开始），它可以根据某种规则设定名次。

s1 = pd.Series(np.arange(4),index=list('dbca'))

s1

Out:
d    0

b    1

c    2

a    3

dtype: int32

索引排序

s1.sort_index()

Out:

a    3

b    1

c    2

d    0

dtype: int32

s1.sort_index(ascending = False)

Out:

d    0

c    2

b    1

a    3

dtype: int32

pd1 = pd.DataFrame(np.arange(12).reshape(4,3),

                index=list('bdca'),columns = list('BCA'))

pd1

Out:
B C A

b 0 1 2

d 3 4 5

c 6 7 8

a 9 10 11

pd1.sort_index(axis=1)

Out:

A B C

b 2 0 1

d 5 3 4

c 8 6 7

a 11 9 10

按值排序

缺失值默认排最后

s1

Out:
d    0.0

b    1.0

c    2.0

a    NaN

dtype: float64

s1.sort_values(ascending=False)

Out:

c    2.0

b    1.0

d    0.0

a    NaN

dtype: float64

frame

Out:
b a c

0 4.3 0 -2.0

1 7.0 1 5.0

2 -3.0 0 -2.5

frame.rank(axis='columns')

Out:

b a c

0 3.0 2.0 1.0

1 3.0 1.0 2.0

2 1.0 3.0 2.0

frame.rank(axis='rows')

Out:

b a c

0 2.0 1.5 2.0

1 3.0 3.0 3.0

2 1.0 1.5 1.0

3.3.5 唯一值＆判断值

s1 = pd.Series([2,6,8,9,8,6],index=['a','a','c','c','c','c'])

s1

Out:
a    2

a    6

c    8

c    9

c    8

c    6

dtype: int64

s2=s1.unique()

s2

Out:

array([2, 6, 8, 9], dtype=int64)

#判断是否唯一

s1.index.is_unique

Out：False

#计算series值的个数

s1.value_counts()

Out:

6    2

8    2

2    1

9    1

dtype: int64

s1.isin([8])

Out：

a    False

a    False

c     True

c    False

c     True

c    False

dtype: bool

#判断多个

s1.isin([8,2])

Out：

a     True

a    False

c     True

c    False

c     True

c    False

dtype: bool

3.3.6 缺失值

df3 = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],

[np.nan, 4., np.nan]])

df3

Out:
0 1 2

0 -1.911606 1.952048 -0.608502

1 1.000000 2.000000 NaN

2 NaN 4.000000 NaN

#isnull()判断是否存在缺失值

df3.isnull()

Out:

0 1 2

0 False False False

1 False False True

2 True False True

#dropna()丢弃缺失数据（默认丢弃行）

df3.dropna()

Out:

0 1 2

0 -1.911606 1.952048 -0.608502

df3.dropna(axis=1)

Out:

1

0 1.952048

1 2.000000

2 4.000000

#fillna()填充缺失数据

df3.fillna(-9999)

Out:

0 1 2

0 -1.911606 1.952048 -0.608502

1 1.000000 2.000000 -9999.000000

2 -9999.000000 4.000000 -9999.000000

3.3.7 统计函数

方法	描述
sum	求和
mean	均值
median	中位数
mad	根据均值计算平均绝对离差
var	方差
std	标准差

方法	描述
cumsum	样本值的累计和
cummin , cummax	样本值的累计最大值和累计最小值
prod	不同维度上的乘积
cumprod	样本值的累计积
diff	计算一阶差分（对时间序列很有用）
pct_change	计算百分数变化

df

Out:
one two

a 1.4 NaN

b 7.1 -4.5

c 2.0 NaN

df.cumsum()

Out:

one two

a 1.4 NaN

b 8.5 -4.5

c 10.5 NaN

df.cummax(axis=1)

Out:

one two

a 1.4 NaN

b 7.1 7.1

c 2.0 NaN

3.4 Pandas数据读写

read_csv: 从文件、URL、文件型对象中加载带分隔符的数据。默认分隔符为逗号

read_tablc: 从文件、URL、文件型对象中加载带分隔符的数据。默认分隔符为制表符（’\t’）

方法

描述

sep

指定分隔符。如果不指定参数，则会尝试使用逗号分隔。分隔符长于一个字符并且不是‘\s+',将使用python的语法分析器。并且忽略数据中的逗号。

names

用于结果的列名列表，如果数据文件中没有列标题行，就需要执行header=None。

index_col

用作行索引的列编号或者列名，如果给定一个序列则有多个行索引。如果文件不规则，行尾有分隔符，则可以设定index_col=False 来是的pandas不适用第一列作为行索引。

方法

描述

skiprows

需要忽略的行数（从文件开始处算起），或需要跳过的行号列表（从0开始）。

nrows

需要读取的行数（从文件头开始算起）。

chunksize

文件块的大小。

......加载中