numpy pandas matplotlib

最新推荐文章于 2025-06-24 17:21:50 发布

树下小憩

最新推荐文章于 2025-06-24 17:21:50 发布

阅读量1.1k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/qq_40911564/article/details/88916091

python 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了使用Python进行数据处理的基本方法，涵盖了Numpy数组操作、Pandas数据清洗与预处理技巧、数据可视化等内容，适合初学者快速掌握数据处理技能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

munpy.mean()

1.numpy.mean(a, axis, dtype, out，keepdims )
求取均值
经常操作的参数为axis，以m * n矩阵举例：
axis 不设置值，对 mn 个数求均值，返回一个实数
axis = 0：压缩行，对各列求均值，返回 1 n 矩阵
axis =1 ：压缩列，对各行求均值，返回 m *1 矩阵

numpy.array()

numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0

object:object 任何暴露数组接口方法的对象都会返回一个数组或任何(嵌套)序列。

dtype:dtype 数组的所需数据类型，可选。

copy:copy 可选，默认为true，对象是否被复制。

order:order C(按行)、F(按列)或A(任意，默认)。

subok:subok 默认情况下，返回的数组被强制为基类数组。如果为true，则返回子类。

ndmin:ndimin 指定返回数组的最小维数。

ndim：矩阵维度属性

shape：行数和列数的值

dtype: 查看数据类型

import numpy as np
a = np.array([[1,3,4],[3,6,9],[22,1,1]])
print(a)          # 打印a矩阵
print(a.ndim)     # a的维度
print(a.shape)    # a的行和列维度的值
print(a.shape[0]) # shape[0]行这个维度
print(a.shape[1]) # shape[1]列这个维度
 
[[ 1  3  4]
 [ 3  6  9]
 [22  1  1]]
2
(3, 3)
3
3

astype:数据类型转换

arr = np.array([1,2,3])
arr.dtype
dtype('int64')

fl_arr = arr.astype(np.float64)
fl_arr.dtype
dtype(float64)

索引

arr = np.array([1,2,3,4],[6,7],[9])

arr[2]
array([6,7])

arr[2][0]

arr[2][0] :
也可以写成arr[2,0]

codecs.open(filepath,method,encoding)

filepath--文件路径

method--打开方式，r为读，w为写，rw为读写

encoding--文件的编码，中文文件使用utf-8

pd.read_json

读取json字符串数据

pd.read_json(codecs.open(train_filename, mode='r', encoding='utf-8'))

简单练习


import numpy as np
import pandas as pd

# 数据可视化代码
from titanic_visualizations import survival_stats
from IPython.display import display
%matplotlib inline

# 加载数据集
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# 显示数据列表中的前几项乘客数据
display(full_data.head())

np.ones():
给定数据类型跟形状成全1数组

np.ones(5, dtype = int)

array([1, 1, 1, 1, 1])

数据科学领域最佳python库

Numpy: 最基础的，运行效率高
Scipy：依赖于numpy，专为科学工程设计
Pandas：结构化数据分析，依赖于numpy
matplotlib: 绘图领域使用最广泛的
5.scikit-learn: 机器学习的模块
矩阵运算：相加，相减的两个矩阵必须要有相同的行和列，行和列对应的元素相加减
数组乘法（点乘）：对应元素之间的乘法

[ a  b] [e  f]== [ae bf]
[ c  d] [h  i]   [ch  di]

矩阵乘法：
概念：距形的数组，即二维数组，其中向量核标量都是矩阵的特例
向量：是指1xn 或者nx1的矩阵
标量：1x1的矩阵
数组：n维的数组，就是矩阵的延伸

unique

np.unique(a)
对于一维数组或列表去重并按元素由大到小返回一个新的无元素重复的元组或者列表

import numpy as np
A = [1, 2, 2, 5,3, 4, 3]
a = np.unique(A)
B= (1, 2, 2,5, 3, 4, 3)
b= np.unique(B)
C= ['fgfh','asd','fgfh','asdfds','wrh']
c= np.unique(C)
print(a)
print(b)
print(c)
#   输出为 [1 2 3 4 5]
# [1 2 3 4 5]
# ['asd' 'asdfds' 'fgfh' 'wrh']

**c,s=np.unique(b,return_index=True) **
return_index=True表示返回新列表元素在旧列表中的位置，并以列表形式储存在s中。

a, s= np.unique(A, return_index=True)
print(a)
print(s)
# 运行结果
# [1 2 3 4 5]
# [0 1 4 5 3]

tolist()

tolist()将矩阵a转换为一个以列表为元素的列表

u = array([[1,2],[3,4]] # 数组
m = u.tolist() #转换为list

m.remove(m[0]) #移除m[0]

m = np.array(m) #转换为array

numpy.transpose()

numpy.transpose()是对矩阵按照所需的要求的转置

两个常用工具 Series ,DataFrame

Series

一维数组对象，包含值，索引，在交互式环境中表示索引在做左边，值在右边

pd.Series(np.ones(5, dtype = int))

0    1
1    1
2    1
3    1
4    1
dtype: int64

iterrows()

以 Pandas 的方式迭代遍历DataFrame的行

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

输出

遍历DataFrame的行
for index, row in df.iterrows():
    print row["c1"], row["c2"]

通过 NumPy 数组创建 DataFrame

dates = pd.date_range('today', periods=6)  # 定义时间序列作为 index
num_arr = np.random.randn(6, 4)  # 传入 numpy 随机数组
columns = ['A', 'B', 'C', 'D']  # 将列表作为列名
df1 = pd.DataFrame(num_arr, index=dates, columns=columns)
df1

输出


                                 A	          B	        C 	       D
2018-12-03 09:12:24.214936	-0.082675	-0.067416	0.149819	-1.534260
2018-12-04 09:12:24.214936	-0.142809	0.124314	-2.247123	0.842353
2018-12-05 09:12:24.214936	1.432994	-1.560499	0.776462	0.911158
2018-12-06 09:12:24.214936	-1.148185	0.075871	-0.023444	-1.034936
2018-12-07 09:12:24.214936	0.351215	-0.307880	-0.128473	0.238551
2018-12-08 09:12:24.214936	0.438020	0.132465	-1.474799	1.013236

df2.head() # 默认为显示 5 行，可根据需要在括号中填入希望预览的行数

查看 DataFrame 的后 3 行数据

df1.tail(3)

查看 DataFrame 的索引

 df2.index

查看 DataFrame 的列名

df1.columns

查看 DataFrame 的数值

df2.values
array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

查看 DataFrame 的统计数据

df2.describe()

	        age	      visits
count	8.000000	10.000000
mean	3.437500	1.900000
std	    2.007797	0.875595
min	    0.500000	1.000000
25%	    2.375000	1.000000
50%	    3.000000	2.000000
75%  	4.625000	2.750000
max	    7.000000	3.000000

DataFrame 转置操作
行索引与列索引（columns）转置

df2.T


          0	 1	 2	   3	4	5	6	7	8	9
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age	     2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	 1	3	2	3	2	3	1	1	2	1
priority  yes	yes	no	yes	no	no	no	yes	no	no

对 DataFrame 进行按列排序¶

df2.sort_values(by='age')  # 按 age 升序排列

	animal	age	visits	priority
2	snake	0.5	2	no
5	cat	2.0	3	no
0	cat	2.5	1	yes
1	cat	3.0	3	yes
9	dog	3.0	1	no
6	snake	4.5	1	no
4	dog	5.0	2	no
8	dog	7.0	2	no
3	dog	NaN	3	yes
7	cat	NaN	1	yes

对 DataFrame 通过标签查询（单列)

df2.age  # 等价于 df2['age']

对 DataFrame 通过标签查询（多列)

df2[['age', 'animal']]  # 传入一个列名组成的列表

	age	animal
0	2.5	cat
1	3.0	cat
2	0.5	snake
3	NaN	dog
4	5.0	dog
5	2.0	cat
6	4.5	snake
7	NaN	cat
8	7.0	dog
9	3.0	dog

对 DataFrame 通过位置查询

df2.iloc[1:3]  # 查询 2，3 行


    animal	age	visits	priority
1	cat 	3.0	   3	    yes
2	snake	0.5	   2	    no

DataFrame 副本拷贝

# 生成 DataFrame 副本，方便数据集被多个不同流程使用
df3 = df2.copy()
df3

判断 DataFrame 元素是否为空

df3.isnull()  # 如果为空则返回为 True

添加列数据

num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)

df3['No.'] = num  # 添加以 'No.' 为列名的新数据列
df3

       animal	age	   visits	priority	No.
0	     cat	2.5	       1	yes	         0
1	     cat	3.0	       3	yes	         1
2	    snake	0.5	    2	    no	         2
3	      dog	NaN	     3	    yes	         3

a = np.random.randn(6) :

一行6列
array([ 1.01918956, -0.07630487,  0.670206  ,  0.11643249, -0.8786764 ,
        0.53472156])
一维数组

a = np.random.randn(6,4)

2维数组 6 代表行，4代表列
array([[-1.46502821,  0.35496013,  1.14157859,  0.75489192],
       [ 1.05971693, -0.95212416, -0.69018161,  1.17303578],
       [ 0.64833749, -1.27319532,  1.02585485,  0.16928617],
       [ 1.23933293, -0.69070133,  0.73011837, -1.64986667],
       [ 1.58974146, -0.21465733,  0.08269329,  0.06843603],
       [ 1.34074247,  0.56058411,  1.91461851, -1.61984593]])

a = np.random.randn(6,4,2)

3维数组
4行 2列 6数组
array([[[-0.78424652,  1.42846068],
        [ 0.19196262, -0.10176529],
        [-0.79707897,  1.09714884],
        [-0.07525452, -0.91242259]],

       [[ 0.29680667,  0.25644389],
        [-1.44185937, -0.47235719],
        [ 0.69264438, -0.26749929],
        [ 0.78584973,  0.35513566]],

       [[ 0.70035031, -0.35090556],
        [ 0.05236121, -0.43648547],
        [-0.85058387,  0.63819206],
        [-0.1499679 , -0.19824881]],

       [[-1.12006453, -0.0466797 ],
        [ 2.25716264, -0.46989652],
        [ 0.8327038 ,  0.35963639],
        [ 0.82197041, -0.45308462]],

       [[-0.66498114, -0.37932524],
        [ 0.73012345,  1.94071078],
        [-0.10040431,  2.001614  ],
        [ 0.24226774,  0.07251607]],

       [[ 0.11695915, -0.27854957],
        [-1.32175319,  0.4192623 ],
        [ 1.15859023, -0.06911767],
        [-0.72360086, -0.25954412]]])

添加列数据

num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)

df3['No.'] = num  # 添加以 'No.' 为列名的新数据列
df3

根据 DataFrame 的下标值进行更改

# 修改第 2 行与第 1 列对应的值 3.0 → 2.0
df3.iat[1, 0] = 2  # 索引序号从 0 开始，这里为 1, 0
df3

根据 DataFrame 的标签对数据进行修改

df3.loc['f','age']=3
df3

   animal	age	visits	priority	No.
a	   cat	2.5	1	yes	1
b	   2	3.0	3	yes	2
c	   snake	0.5	2	no	3
d 	   dog	NaN	3	yes	45
e	   dog	5.0	2	no	6
f	   cat	3.0	3	no	4
g	   snake	4.5	1	no	0
h	   cat	NaN	1	yes	7
i	   dog	7.0	2	no	8
j	   dog	3.0	1	no	9

DataFrame 求平均值操作

df3.mean()

age       3.5625
visits    1.9000
No.       8.5000
dtype: float64

将字符串转化为小写字母

string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
                    np.nan, 'CABA', 'dog', 'cat'])
print(string)
string.str.lower()

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

将字符串转化为大写字母

string.str.upper()

DataFrame 缺失值操作

对缺失值进行填充¶

# 先进行拷贝，查找到空置并进行填充
df4  = df3.copy()
print(df4)
df4.fillna(value=1000000)
  animal  age  visits priority  No.
a    cat  2.5       1      yes    1
b      2  3.0       3      yes    2
c  snake  0.5       2       no    3
d    dog  NaN       3      yes   45
e    dog  5.0       2       no    6
f    cat  3.0       3       no    4
g  snake  4.5       1       no    0
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9



animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	2	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	1000000.0	3	yes	45
e	dog	5.0	2	no	6
f	cat	3.0	3	no	4
g	snake	4.5	1	no	0
h	cat	1000000.0	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

删除存在缺失值的行

df5 = df4.copy()
print()
df5.dropna(how='any')

  animal  age  visits priority  No.
a    cat  2.5       1      yes    1
b      2  3.0       3      yes    2
c  snake  0.5       2       no    3
d    dog  NaN       3      yes   45
e    dog  5.0       2       no    6
f    cat  3.0       3       no    4
g  snake  4.5       1       no    0
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	2	3.0	3	yes	2
c	snake	0.5	2	no	3
e	dog	5.0	2	no	6
f	cat	3.0	3	no	4
g	snake	4.5	1	no	0
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

** DataFrame 按指定列对齐**

left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})

print(left)
print(right)

# 按照 key 列对齐连接，只存在 foo2 相同，所以最后变成一行
pd.merge(left, right, on='key')
    key  one
0  foo1    1
1  foo2    2
    key  two
0  foo2    4
1  foo3    5
key	one	two
0	foo2	2	4

DataFrame 文件操作

CSV 文件写入

df3.to_csv('animal.csv')
print("写入成功.")

CSV 文件读取

df_animal = pd.read_csv("animal.csv")
df_animal

	Unnamed: 0	animal	age	visits	priority	No.
0	a	cat	2.5	1	yes	1
1	b	2	3.0	3	yes	2
2	c	snake	0.5	2	no	3
3	d	dog	NaN	3	yes	45
4	e	dog	5.0	2	no	6
5	f	cat	3.0	3	no	4
6	g	snake	4.5	1	no	0
7	h	cat	NaN	1	yes	7
8	i	dog	7.0	2	no	8
9	j	dog	3.0	1	no	9

Excel 写入操作

df3.to_excel("animal.xlsx",sheet_name = "Sheet1")
print("写入成功")

Excel 读取操作


pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])



animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	2	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	45
e	dog	5.0	2	no	6
f	cat	3.0	3	no	4
g	snake	4.5	1	no	0
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

进阶部分

时间序列索引

建立一个以 2018 年每一天为索引，值为随机数的 Series

dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
s = pd.Series(np.random.rand(len(dti)), index=dti)
s



2018-01-01   -0.362893
2018-01-02    0.138087
2018-01-03   -0.463004
2018-01-04    1.789864
2018-01-05    0.210887
2018-01-06    0.345864
2018-01-07   -1.518800
2018-01-08    0.158643
2018-01-09    0.375234
2018-01-10    0.749695
2018-01-11   -0.727865
2018-01-12    0.924250
2018-01-13   -0.529816
2018-01-14   -1.467113
2018-01-15   -0.446632
2018-01-16    0.335176
2018-01-17    1.858967
2018-01-18   -0.150502
2018-01-19   -0.210001
2018-01-20    1.273068
2018-01-21   -0.256310
2018-01-22   -0.058372
2018-01-23    0.218605
2018-01-24   -1.205350
2018-01-25   -2.282698
2018-01-26   -1.168071
2018-01-27    0.625110
2018-01-28    0.908457
2018-01-29    0.053577
2018-01-30   -1.146219
                ...   
2018-12-02   -1.582800
2018-12-03   -1.717248
2018-12-04    1.177545
2018-12-05    0.492441
2018-12-06   -1.176167
2018-12-07   -0.658026
2018-12-08    0.264001
2018-12-09    0.008442
2018-12-10   -1.088271
2018-12-11    0.627408
2018-12-12   -0.101300
2018-12-13    0.087599
2018-12-14   -0.007576
2018-12-15    1.972353
2018-12-16   -0.747422
2018-12-17   -0.320284
2018-12-18    0.260397
2018-12-19    0.416506
2018-12-20    0.286775
2018-12-21   -1.839240
2018-12-22   -0.357600
2018-12-23   -1.236817
2018-12-24    0.453598
2018-12-25   -0.642702
2018-12-26   -0.907909
2018-12-27   -0.753226
2018-12-28    0.387926
2018-12-29   -0.451129
2018-12-30    1.660053
2018-12-31    1.170976
Freq: D, Length: 365, dtype: float64

统计s 中每一个周三对应值的和

s[s.index.weekday==2].sum()

统计s中每个月值的平均值()

s.resample('M').mean()
2018-01-31   -0.036959
2018-02-28    0.141772
2018-03-31    0.101216
2018-04-30    0.395403
2018-05-31    0.279149
2018-06-30   -0.048444
2018-07-31   -0.115449
2018-08-31    0.079646
2018-09-30    0.113726
2018-10-31    0.067740
2018-11-30    0.378494
2018-12-31   -0.142576
Freq: M, dtype: float64

将 Series 中的时间进行转换（秒转分钟）

s = pd.date_range('today', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)

ts.resample('Min').sum()

DataFrame 多重索引

根据多重索引创建 DataFrame

创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引，值为随机数据的多重索引 DataFrame。

numpy函数:arange(),reshape()用法:

arange()用于生成一维数组
reshape()将一维数组转换为多维数组

import numpy as np

print('默认一维为数组:', np.arange(5))
print('自定义起点一维数组:',np.arange(1, 5))
print('自定义起点步长一维数组:',np.arange(2, 10, 2))
print('二维数组:', np.arange(8).reshape((2, 4)))
print('三维数组:', np.arange(60).reshape((3, 4, 5)))

print('指定范围三维数组:',np.random.randint(1, 8, size=(3, 4, 5)))

输出

默认一维数组: [0 1 2 3 4]
自定义起点一维数组: [1 2 3 4]
自定义起点步长一维数组: [2 4 6 8]
二维数组: [[0 1 2 3]
 [4 5 6 7]]
三维数组: [[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]]

 [[20 21 22 23 24]
  [25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]]

 [[40 41 42 43 44]
  [45 46 47 48 49]
  [50 51 52 53 54]
  [55 56 57 58 59]]]
指定范围三维数组: [[[2 3 2 1 5]
  [6 5 5 6 7]
  [4 4 6 5 3]
  [2 2 3 5 6]]

 [[2 1 2 4 4]
  [1 4 2 1 4]
  [4 4 3 4 2]
  [4 1 4 4 1]]

 [[6 2 2 7 6]
  [2 6 1 5 5]
  [2 6 7 2 1]
  [3 3 1 4 2]]]
[[[3 3 5 6]
  [2 1 6 6]
  [1 1 3 5]]

 [[7 6 5 3]
  [5 6 5 4]
  [6 5 7 1]]]

数据清洗

常常我们得到的数据是不符合我们最终处理的数据要求，包括许多缺省值以及坏的数据，需要我们对数据进行清洗。

缺失值拟合

在FilghtNumber中有数值缺失，其中数值为按 10 增长，补充相应的缺省值使得数据完整，并让数据为 int 类型。

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                               'Budapest_PaRis', 'Brussels_londOn'],
                   'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                   'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                               '12. Air France', '"Swiss Air"']})
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df


	From_To	FlightNumber	RecentDelays	Airline
0	LoNDon_paris	10045	[23, 47]	KLM(!)
1	MAdrid_miLAN	10055	[]	<Air France> (12)
2	londON_StockhOlm	10065	[24, 43, 87]	(British Airways. )
3	Budapest_PaRis	10075	[13]	12. Air France
4	Brussels_londOn	10085	[67, 32]	"Swiss Air"

数据列拆分
其中From_to应该为两独立的两列From和To，将From_to依照_拆分为独立两列建立为一个新表

temp = df.From_To.str.split("_",expand=True)
temp.columns = ['From', 'To']
temp

From	To
0	LoNDon	paris
1	MAdrid	miLAN
2	londON	StockhOlm
3	Budapest	PaRis
4	Brussels	londOn

字符标准化
其中注意到地点的名字都不规范（如：londON应该为London）需要对数据进行标准化处理。

temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()


     From	To
0	London	Paris
1	Madrid	Milan
2	London	Stockholm
3	Budapest	Paris
4	Brussels	London

删除坏数据加入整理好的数据

最开始的From_to列删除，加入整理好的From和to列。

df = df.drop("From_To",axis=1)
df = df.join(temp)
df


	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM(!)	London	Paris
1	10055	[]	<Air France> (12)	Madrid	Milan
2	10065	[24, 43, 87]	(British Airways. )	London	Stockholm
3	10075	[13]	12. Air France	Budapest	Paris
4	10085	[67, 32]	"Swiss Air"	Brussels	London

去除多余字符

如同 airline 列中许多数据有许多其他字符，会对后期的数据分析有较大影响，需要对这类数据进行修正。

df['Airline'] = df['Airline'].str.extract(
    '([a-zA-Z\s]+)', expand=False).str.strip()
df

	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM	London	Paris
1	10055	[]	Air France	Madrid	Milan
2	10065	[24, 43, 87]	British Airways	London	Stockholm
3	10075	[13]	Air France	Budapest	Paris
4	10085	[67, 32]	Swiss Air	Brussels	London

格式规范

在 RecentDelays 中记录的方式为列表类型，由于其长度不一，这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开，取出列表中的相同位置元素作为一列，若为空值即用 NaN 代替

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n)
                  for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)
df

	FlightNumber	Airline	From	To	delay_1	delay_2	delay_3
0	10045	KLM	London	Paris	23.0	47.0	NaN
1	10055	Air France	Madrid	Milan	NaN	NaN	NaN
2	10065	British Airways	London	Stockholm	24.0	43.0	87.0
3	10075	Air France	Budapest	Paris	13.0	NaN	NaN
4	10085	Swiss Air	Brussels	London	67.0	32.0	NaN

/Users/liyang/Desktop/123.png

数据预处理

信息区间划分

班级一部分同学的数学成绩表，如下图所示

df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})

但我们更加关心的是该同学是否及格，将该数学成绩按照是否>60来进行划分。

import pandas as pd
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                            'Frank', 'Grace', 'Jenny'], 'grades': [58, 83, 79, 65, 93, 45, 61, 88]})


def choice(x):
    if x > 60:
        return 1
    else:
        return 0


df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df

	name	grades
0	Alice	0
1	Bob	1
2	Candy	1
3	Dany	1
4	Ella	1
5	Frank	0
6	Grace	1
7	Jenny	1

数据去重

一个列为A的 DataFrame 数据，如下图所示

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

尝试将 A 列中连续重复的数据清除。

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
print(df)
df.loc[df['A'].shift() != df['A']]



    A
0   1
1   2
2   2
3   3
4   4
5   5
6   5
7   5
8   6
9   7
10  7

    A
0	1
1	2
3	3
4	4
5	5
8	6
9	7

数据归一化

有时候，DataFrame 中不同列之间的数据差距太大，需要对其进行归一化处理。
其中，Max-Min 归一化是简单而常见的一种方式，公式如下:

import numpy as np
def normalization(df):
    numerator = df.sub(df.min())
    denominator = (df.max()).sub(df.min())
    Y = numerator.div(denominator)
    return Y


df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)



          0         1         2
0  0.015459  0.516737  0.789905
1  0.571452  0.894519  0.653478
2  0.884207  0.823804  0.936545
3  0.781063  0.685239  0.241068
4  0.469597  0.840986  0.594445

        0	       1	            2
0	0.000000	0.000000	0.789152
1	0.639993	1.000000	0.592988
2	1.000000	0.812814	1.000000
3	0.881273	0.446028	0.000000
4	0.522749	0.858295	0.508107

Pandas 绘图操作

为了更好的了解数据包含的信息，最直观的方法就是将其绘制成图。

Series 可视化

%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

DataFrame 折线图

df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                 columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

DataFrame 散点图

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

DataFrame 柱形图

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                   "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                   "month": range(12)
                   })

ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

display

pd.set_opent(display.max_columns,10)


pd.set_option('display.max_rows', 10)

pd.set_option('display.max_columns', 10)

显示的最大行数和列数，如果超额就显示省略号，这个指的是多少个dataFrame的列。如果比较多又不允许换行，就会显得很乱

1、pd.set_option('expand_frame_repr', False)

True就是可以换行显示。设置成False的时候不允许换行

2、pd.set_option('display.max_rows', 10)

pd.set_option('display.max_columns', 10)

显示的最大行数和列数，如果超额就显示省略号，这个指的是多少个dataFrame的列。如果比较多又不允许换行，就会显得很乱。

3、pd.set_option('precision', 5)

显示小数点后的位数

4、pd.set_option('large_repr', A)

 truncate表示截断，info表示查看信息，一般选truncate

5、pd.set_option('max_colwidth', 5)

列长度

6、pd.set_option('chop_threshold', 0.5)

绝对值小于0.5的显示0.0

7、pd.set_option('colheader_justify', 'left')

 显示居中还是左边，

8、pd.set_option('display.width', 200)

横向最多显示多少个字符， 一般80不适合横向的屏幕，平时多用200.

value_counts

value_counts()是一种查看表格某列中有多少个不同值的快捷方法，并计算每个不同值有在该列中有多少重复值

在Series中

import pandas as pd
import numpy as np
from pandas import DataFrame
from pandas import Series
s1=Series(["timo","mike","anni","timo"])
s1.value_counts()

timo    2
mike    1
anni    1
dtype: int64

DataFrame

import pandas as pd
import numpy as np
from pandas import DataFrame
from pandas import Series
df1= DataFrame(
                {"handsome":["timo","anni","timo"],
                "smart":["mike","anni","mike"]}
                )
print(df1)
df1.apply(pd.value_counts)##数据框要借助apply来应用value_counts()

 handsome smart
0     timo  mike
1     anni  anni
2     timo  mike

排序sort_values

可以将数据集依照某个字段中的数据进行排序，该函数即可根据指定列数据也可根据指定行的数据排序

DataFrame.sort_values(by=‘##’,axis=0,ascending=True, inplace=False, na_position=‘last’)

参数	说明
by	指定列名(axis=0或’index’)或索引值(axis=1或’columns’)
axis	若axis=0或’index’，则按照指定列中数据大小排序；若axis=1或’columns’，则按照指定索引中数据大小排序，默认axis=0
ascending	是否按指定列的数组升序排列，默认为True，即升序排列
inplace	是否用排序后的数据集替换原来的数据，默认为False，即不替换
na_position	{‘first’,‘last’}，设定缺失值的显示位置

loc and ioc

https://blog.youkuaiyun.com/qq_21840201/article/details/80725433

isin()

与loc()iloc() 可做筛选条件
接受一个列表，判断该列中元素是否在列表中
https://www.jianshu.com/p/805f20ac6e06

set_index（）

使用一个或多个现有列设置索引, 默认情况下生成一个新对象

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

drop:默认为true，表示是否删除列作为新索引。

append：是否增加列到原来的索引上。

inplace：是否创建一个新的dataframe

reset_index （）

DataFrame.reset_index（level = None，drop = False，inplace = False，col_level = 0，col_fill =’’ ）[source]
对于具有多级索引的DataFrame，在索引名称下的列中返回带有标签信息的新DataFrame，默认为“level_0”，“level_1”等，如果有，则为None。对于标准索引，将使用索引名称（如果已设置），否则将使用默认的“index”或“level_0”（如果已经采用“index”）。
level：int，str，tuple或list，默认为None

仅从索引中删除给定的级别。默认情况下删除所有级别

drop：布尔值，默认为False

不要尝试将索引插入到dataframe列中。这会将索引重置为默认整数索引。

inplace：布尔值，默认为False

修改DataFrame（不要创建新对象）

col_level：int或str，默认值为0

如果列具有多个级别，则确定标签插入的级别。默认情况下，它会插入到第一级。

col_fill：object，default’’

如果列具有多个级别，则确定其他级别的命名方式。如果为None，则重复索引名称。

reset_index可以还原索引，从新变为默认的整型索引
DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=”)
level控制了具体要还原的那个等级的索引
drop=True
我们可以使用drop参数来避免将旧索引添加为列