基于matplotlib、seaborn的数据可视化
3-3-1python中的数据可视化
matplotlib绘图的标准库;seaborn用于美化matplotlib所绘图像的库
3-3-2环境准备
#基于pycharm
#用于数值计算的库
import numpy as np
import pandas as pd
#用于绘图的库
from matplotlib import pyplot as plt
3-3-3用pyplot绘制折线图
折线图直观的展示了数据的变化
#基于pycharm
#用于数值计算的库
import numpy as np
import pandas as pd
#用于绘图的库
from matplotlib import pyplot as plt
x = np.array([0,1,2,3,4,5,6,7,8,9])
y = np.array([2,3,4,3,5,4,6,7,4,8])
plt.plot(x, y, color = 'black')
plt.title("lineplot matplotlib")
plt.xlabel("x")
plt.ylabel("y")
plt.show()#pycharm加这句话不然图像出不来
生成图形保存文件
plt.savefig("文件名")
3-3-4seaborn和pyplot绘制折线图
导入seaborn并将其命名为sns,执行sns.set(),图形的外观就会改变
#基于pycharm
#用于数值计算的库
import numpy as np
import scipy as sp
import seaborn as sns
import pandas as pd
#用于绘图的库
from matplotlib import pyplot as plt
x = np.array([0,1,2,3,4,5,6,7,8,9])
y = np.array([2,3,4,3,5,4,6,7,4,8])
sns.set()
plt.plot(x, y, color = 'black')
plt.title("lineplot matplotlib")
plt.xlabel("x")
plt.ylabel("y")
plt.show()#pycharm加这句话不然图像出不来
3-3-5用seaborn绘制直方图
seaborn的函数功能强大,有时会自动添加各种标签
sns.distplot(),bins = 5表示分割为5组并求各组的频数,kde = False表示禁用核密度估计
import numpy as np
import scipy as sp
import seaborn as sns
import pandas as pd
#用于绘图的库
from matplotlib import pyplot as plt
fish_data = np.array([2,3,3,4,4,4,4,5,5,6])
print(fish_data)#[2 3 3 4 4 4 4 5 5 6]
sns.distplot(fish_data, bins = 5,
color='black',kde =False)
plt.show()
3-3-6通过和密度估计将直方图平滑化
核密度估计是为了解决直方图的形状会随着组的大小变化而剧烈变动。设bins = 1,那么直方图将完全无法体现数据的特征。
sns.distplot(fish_data, bins = 1,
color='black',kde =False)
plt.show()
只要去掉kde以及bins,绘制了一条平滑的曲线,直方图的面积相当于1
fish_data = np.array([2,3,3,4,4,4,4,5,5,6])
# print(fish_data)#[2 3 3 4 4 4 4 5 5 6]
sns.distplot(fish_data,color='black')
plt.show()
3-3-7两个变量的直方图
多个变量的直方图可以绘制在一起
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
path = path = "D:\\【源码】用Python动手学统计学\\pystat-code-2021-01-25\\3-3-2-fish_multi_2.csv"
fish_multi = pd.read_csv(path)
# print(fish_multi)
# species length
# 0 A 2
# 1 A 3
# 2 A 3
# 3 A 4
# 4 A 4
# 5 A 4
# 6 A 4
# 7 A 5
# 8 A 5
# 9 A 6
# 10 B 5
# 11 B 6
# 12 B 6
# 13 B 7
# 14 B 7
# 15 B 7
# 16 B 7
# 17 B 8
# 18 B 8
# 19 B 9
print(fish_multi.groupby("species").describe())
# length
# count mean std min 25% 50% 75% max
# species
# A 10.0 4.0 1.154701 2.0 3.25 4.0 4.75 6.0
# B 10.0 7.0 1.154701 5.0 6.25 7.0 7.75 9.0
length_a = fish_multi.query('species == "A"')["length"]
length_b = fish_multi.query('species == "B"')["length"]
sns.distplot(length_a,bins=5,
color='black',kde = False)
sns.distplot(length_b,bins=5,
color='black',kde = False)
plt.show()
3-3-8将多变量可视化的代码
直方图主要用于单变量,所以最好分别为每个变量绘图,但是会很繁琐。使用seaborn绘制多变量的图形
sns.函数名{
x = "x轴对应的列名"
y = "y轴对应的列名"
data = 数据帧,
其他参数
}
3-3-9箱型图
数据属于分类变量和定量变量混合的数据。在表示这种数据时,多使用箱型图,也叫盒须图
结合describe()函数进行理解
sns.boxplot(x = "species",y = "length",
data = fish_multi,color='gray')
plt.show()
3-3-10小提琴图
小提琴图和箱型图相似,他用核密度估计的结果替换了箱子。平滑的曲线是核密度估计的结果。同时还展现了频数的最高位。直方图和箱型图的结合。
sns.violinplot(x = "species",y = "length",
data = fish_multi,color='gray')
plt.show()
# length
# count mean std min 25% 50% 75% max
# species
# A 10.0 4.0 1.154701 2.0 3.25 4.0 4.75 6.0
# B 10.0 7.0 1.154701 5.0 6.25 7.0 7.75 9.0
3-3-11条形图
sns.barplot(x = "species",y = "length",
data = fish_multi,color='gray')
plt.show()
各条的高度表示均值。黑线叫做误差线,代表置信区间
3-3-12散点图
path = "D:\\【源码】用Python动手学统计学\\pystat-code-2021-01-25\\3-2-3-cov.csv"
cov_data = pd.read_csv(path)
print(cov_data)
# x y
# 0 18.5 34
# 1 18.7 39
# 2 19.1 41
# 3 19.7 38
# 4 21.5 45
# 5 21.7 41
# 6 21.8 52
# 7 22.0 44
# 8 23.4 44
# 9 23.8 49
sns.jointplot(x = "x",y = "y",
data = cov_data,color='black')
plt.show()
pearsonr是相关系数,p是假设检验的结果,相关系数为0.76,整体向右上方倾斜
3-3-13散点图矩阵
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
iris = sns.load_dataset("iris")
print(iris.head(n = 3))
# sepal_length sepal_width petal_length petal_width species
# 0 5.1 3.5 1.4 0.2 setosa
# 1 4.9 3.0 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
print(iris.groupby("species").mean())
# sepal_length sepal_width petal_length petal_width
# species
# setosa 5.006 3.428 1.462 0.246
# versicolor 5.936 2.770 4.260 1.326
# virginica 6.588 2.974 5.552 2.026
sns.pairplot(iris,hue="species",palette='gray')
参考资料
[日] 马场真哉 著, 吴昊天 译. 用Python动手学统计学[M]. 1. 人民邮电出版社, 2021-06-01.
菜鸟网站python3