03_python数据分析之numpy

最新推荐文章于 2025-04-12 12:19:16 发布

原创最新推荐文章于 2025-04-12 12:19:16 发布 · 1.2k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#python #数据分析 #numpy

Python学习篇专栏收录该内容

27 篇文章

订阅专栏

本文详细介绍了Python科学计算库numpy的基础知识，包括数组创建、数据类型操作、数组形状、计算、广播原则、轴的概念、读取数据、转置、索引切片、数值修改、布尔索引、三元运算、clip函数以及统计函数的应用。通过实例讲解了numpy在处理CSV数据、矩阵运算、数据统计等方面的功能，并提供了实战练习。

上一篇： python数据分析之matplotlib
为什么要学习numpy
numpy有三大优点：

快速
方便
科学计算的基础库

1.什么是 numpy ？

一个在Python中做科学计算的基础库，重在数值计算，也是大部分Python科学计算库的基础库，多用于在大型、多维数组上执行数值运算

2. numpy 基础

2.1 numpy 创建数组（矩阵）

在这里插入图片描述

# coding=utf-8
import numpy as np
import random

# 使用numpy生成数组,得到ndarray的类型
t1 = np.array([1, 2, 3, ])
print(t1)
print(type(t1))

t2 = np.array(range(10))
print(t2)
print(type(t2))

t3 = np.arange(4, 10, 2)
print(t3)
print(type(t3))

print(t3.dtype)
print("*" * 100)
# numpy中的数据类型

t4 = np.array(range(1, 4), dtype="i1")
print(t4)
print(t4.dtype)

# numpy中的bool类型
t5 = np.array([1, 1, 0, 1, 0, 0], dtype=bool)
print(t5)
print(t5.dtype)

# 调整数据类型
t6 = t5.astype("int8")
print(t6)
print(t6.dtype)

# numpy中的小数
t7 = np.array([random.random() for i in range(10)])
print(t7)
print(t7.dtype)

t8 = np.round(t7, 2)
print(t8)

2.2 numpy 中常见的更多数据类型

常见的数据类型：int、float、string等等。
在这里插入图片描述

2.3 数据类型的操作

包括指定创建的数组的数据类型、修改数组的数据类型、修改浮点型的小数位数。
在这里插入图片描述
那么问题来了,python中如何保留固定位数的小数?
参考链接：https://blog.youkuaiyun.com/whjstudy1/article/details/79528720

2.4 数组的形状

主要包括：查看数组的形状**.shape**、修改数组形状**.reshape**
在这里插入图片描述

修改数组形状不会改变原来数组的形状

2.4 数组和数的计算

在这里插入图片描述

不同维度的数组是无法计算的，除非他们之间的行、列数有任意一个相同。

2.5 广播原则

在这里插入图片描述
怎么理解呢?
可以把维度指的是shape所对应的数字个数
那么问题来了:
shape为(3,3,3)的数组能够和(3,2)的数组进行计算么?
shape为(3,3,2)的数组能够和(3,2)的数组进行计算么?
有什么好处呢?
举个例子:每列的数据减去列的平均值的结果

2.6 轴（axis）

在这里插入图片描述

0 轴表示列索引，1 轴表示行索引

2.7 numpy 读取数据

CSV:Comma-Separated Value,逗号分隔值文件
显示：表格状态
源文件：换行和逗号分隔行列的格式化文本,每一行的数据表示一条记录

由于csv便于展示,读取和写入,所以很多地方也是用csv的格式存储和传输中小型的数据,为了方便教学,我们会经常操作csv格式的文件,但是操作数据库中的数据也是很容易的实现的。

import numpy as np
np.loadtxt(fname,dtype=np.float,delimiter=None,skiprows=0,usecols=None,unpack=False)

参数说明
在这里插入图片描述

2.8 numpy 中的转置

转置是一种变换,对于numpy中的数组来说,就是在对角线方向交换数据,目的也是为了更方便的去处理数据
在这里插入图片描述
以上的三种方法都可以实现二维数组的转置的效果,大家能够看出来,转置和交换轴的效果一样

# coding=utf-8
import numpy as np

'''
转置是一种变换,对于numpy中的数组来说,就是在对角线方向交换数据,目的也
是为了更方便的去处理数据
常用方法：transpose()、.T和swapaxes(1,0)
'''

'''
# 转置的方法
t=np.arange(24).reshape(4,6)
print(t)
print(t.transpose())
print(t.T)
print(t.swapaxes(1,0))
'''

us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"

t1 = np.loadtxt(us_file_path, delimiter=",", dtype=int, unpack=True)
t2 = np.loadtxt(us_file_path, delimiter=",", dtype=int)
# print(t1)
print("*" * 100)
print(t2)
print("*" * 100)
# 取行
print(t2[2])

# 取连续的多行
print("*" * 100)
print(t2[2:])

# 取不连续的多行
print("*" * 100)
print(t2[[2, 8, 10]])

# print(t2[1,:])
# print(t2[2:,:])
# print(t2[[2,10,3],:])

# 取列
print("*" * 100)
print(t2[:, 0])

# 取连续的多列
print("*" * 100)
print(t2[:, 2:])

# 取不连续的多列
print("*" * 100)
print(t2[:, [0, 2]])

# 取行和列，取第3行，第四列的值
print("*" * 100)
a = t2[2, 3]
print(a)
print(type(a))

# 取多行和多列，取第3行到第五行，第2列到第4列的结果
# 取的是行和列交叉点的位置
print("*" * 100)
b = t2[2:5, 1:4]
print(b)

# 取多个不相邻的点
# 选出来的结果是（0，0） （2，1） （2，3）
print("*" * 100)
c = t2[[0, 2, 2], [0, 1, 3]]
print(c)

练习
现在这里有一个英国和美国各自youtube1000多个视频的点击,喜欢,不喜欢,评论数量([“views”,“likes”,“dislikes”,“comment_total”])的csv,运用刚刚所学习的只是,我们尝试来对其进行操作

数据来源:https://www.kaggle.com/datasnaek/youtube/data

补充：对于复制一个目录到另外一个目录下的命令：先切换到所需要目录，在输入命令：cd -rf 文件 .
cp -rf ~/Documents/DataAnalysis/day03/code/youtube_video_data .

# coding="utf-8"
'''
现在希望把之前案例中两个国家的数据方法一起来研究分析，
同时保留国家的信息（每条数据的国家来源），应该怎么办
'''
import numpy as np

us_data="./youtube_video_data/US_video_data_numbers.csv"
uk_data="./youtube_video_data/GB_video_data_numbers.csv"

# 1.加载国家数据
us_data = np.loadtxt(us_data,delimiter=",",dtype=int)
uk_data = np.loadtxt(uk_data,delimiter=",",dtype=int)
print(us_data.shape[0],uk_data.shape[0])
'''
1688 1600
'''

# 2.添加国家信息
# 构造全为0,1的数据
zeros_data=np.zeros((us_data.shape[0],1)).astype(int)
ones_data=np.ones((uk_data.shape[0],1)).astype(int)

# print(zeros_data)
# 分别添加一列全为0,1的数组
us_data=np.hstack((us_data,zeros_data))
print(us_data)
'''
[[4394029  320053    5931   46245       0]
 [7860119  185853   26679       0       0]
 [5845909  576597   39774  170708       0]
 ...
 [ 142463    4231     148     279       0]
 [2162240   41032    1384    4737       0]
 [ 515000   34727     195    4722       0]]
'''
uk_data=np.hstack((uk_data,ones_data))
print(uk_data)
'''
[[7426393   78240   13548     705       1]
 [ 494203    2651    1309       0       1]
 [ 142819   13119     151    1141       1]
 ...
 [ 109222    4840      35     212       1]
 [ 626223   22962     532    1559       1]
 [  99228    1699      23     135       1]]
'''

# 3.拼接两组数据
final_data=np.vstack((us_data,uk_data))
print(final_data)
'''
[[4394029  320053    5931   46245       0]
 [7860119  185853   26679       0       0]
 [5845909  576597   39774  170708       0]
 ...
 [ 109222    4840      35     212       1]
 [ 626223   22962     532    1559       1]
 [  99228    1699      23     135       1]]
'''

那么,结合之前的所学的matplotlib把英国和美国的数据呈现出来?

看到这个问题,我们应该考虑什么?
我们想要反映出什么样的结果,解决什么问题? 选择什么样的呈现方式？
数据还需要做什么样的处理?
写代码

# coding="utf-8"
import numpy as np
import matplotlib.pyplot as plt

us_data="./youtube_video_data/US_video_data_numbers.csv"
uk_data="./youtube_video_data/GB_video_data_numbers.csv"

t_us = np.loadtxt(us_data,delimiter=",",dtype=int)

# 取评论数
t_us_comments=t_us[:,-1]
# 选择比5000小的数据
t_us_comments=t_us_comments[t_us_comments<5000]
# 计算组数
d = 250 # 组距
bin_nums = (max(t_us_comments) - min(t_us_comments)) // d

# 绘图
plt.figure(figsize=(20,8),dpi=80)
plt.hist(t_us_comments,bin_nums,density=False)
plt.savefig('./t_04.png')
plt.show()

2.9 numpy 索引和切片

对于刚刚加载出来的数据,我如果只想选择其中的某一列(行)我们应该怎么做呢?
其实操作很简单,和python中列表的操作一样
在这里插入图片描述

2.10 numpy 中数值的修改

修改行列的值，我们能够很容易的实现，但是如果条件更复杂呢？
比如我们想要把t中小于10的数字替换为3
在这里插入图片描述

2.11 numpy 中的布尔索引

在这里插入图片描述

2.12 numpy 中的三元运算

在这里插入图片描述

2.13 numpy 中的clip(裁剪)

在这里插入图片描述

2.14 numpy 中的nan和inf

在这里插入图片描述
numpy中的nan的注意点

3. numpy 中常用统计函数

在这里插入图片描述

3.1 ndarry缺失值填充均值

在这里插入图片描述
学习要学会总结，看到这里你可以思考下列问题：

如何选择一行或者多行的数据（列）？
如何给选取的行或者列赋值？
如何大于把大于10的值替换为10？
np.where如何使用？
np.clip如何使用？
如何转置（交换轴）？
读取和保存数据为csv
np.nan和np.inf是什么
常用的统计函数你记得几个？
标准差反映出数据的什么信息
练习
英国和美国各自youtube1000的数据结合之前的matplotlib绘制出各自的评论数量的直方图
希望了解英国的youtube中视频的评论数和喜欢数的关系，应该如何绘制改图

# coding="utf-8"
import numpy as np
import matplotlib.pyplot as plt

us_data="./youtube_video_data/US_video_data_numbers.csv"
uk_data="./youtube_video_data/GB_video_data_numbers.csv"

t_us = np.loadtxt(us_data,delimiter=",",dtype=int)

# 取评论数
t_us_comments=t_us[:,-1]
# 选择比5000小的数据
t_us_comments=t_us_comments[t_us_comments<5000]
# 计算组数
d = 250 # 组距
bin_nums = (max(t_us_comments) - min(t_us_comments)) // d

# 绘图
plt.figure(figsize=(20,8),dpi=80)
plt.hist(t_us_comments,bin_nums,density=False)
plt.savefig('./t_04.png')
plt.show()

# coding="utf-8"
import numpy as np
import matplotlib.pyplot as plt

us_data = "./youtube_video_data/US_video_data_numbers.csv"
uk_data = "./youtube_video_data/GB_video_data_numbers.csv"

t_uk = np.loadtxt(uk_data, delimiter=",", dtype='int')
#选择喜欢数比50万小的数据
t_uk=t_uk[t_uk[:,1]<500000]

t_uk_comment=t_uk[:,-1]
t_uk_like=t_uk[:,1]

plt.figure(figsize=(20,8),dpi=80)
plt.scatter(t_uk_like,t_uk_comment)
plt.savefig('./t_05.png')
plt.show()