python 可视化 Data Vistualization-优快云博客

本文链接：https://blog.youkuaiyun.com/yu0046/article/details/123744141

Data Vistualization

libraries to plot:

matplotlib (plt)

pandas .plot()

seaborn (sns)

bar plot

rug plots

(easy to see distributions)

histograms

= bar + (areas represent proportions)

default: show counts on y-axis.

‘density = True’ : the total areas sums to 1

To decide the number of bins:
$\frac{IQR(x)}{\sqrt[3]n}$

IQR: 四分位距（interquartile range, IQR），又称四分差。是描述统计学中的一种方法，以确定第三四分位数和第一四分位数的区别。

IQR = third quartile - first quartile

can be used with linspace

Density curves

sns.displot(bweights, kind = 'kde')
sns.kdeplot(bweights)

KDE: kernel density estimation 核密度估计

密度估计的问题
由给定样本集合求解随机变量的分布密度函数问题是概率统计学的基本问题之一。解决这一问题的方法包括参数估计和非参数估计。

参数估计
参数估计又可分为参数回归分析和参数判别分析。在参数回归分析中，人们假定数据分布符合某种特定的性态，如线性、可化线性或指数性态等，然后在目标函数族中寻找特定的解，即确定回归模型中的未知参数。在参数判别分析中，人们需要假定作为判别依据的、随机取值的数据样本在各个可能的类别中都服从特定的分布。经验和理论说明，参数模型的这种基本假定与实际的物理模型之间常常存在较大的差距，这些方法并非总能取得令人满意的结果。

参数估计：最大似然估计MLE

非参数估计方法
由于上述缺陷，Rosenblatt和Parzen提出了非参数估计方法，即核密度估计方法。由于核密度估计方法不利用有关数据分布的先验知识，对数据分布不附加任何假定，是一种从数据样本本身出发研究数据分布特征的方法，因而，在统计学理论和应用领域均受到高度的重视。

核密度估计（kernel density estimation）是在概率论中用来估计未知的密度函数，属于非参数检验方法之一，由Rosenblatt (1955)和Emanuel Parzen(1962)提出，又名Parzen窗（Parzen window）。Ruppert和Cline基于数据集密度函数聚类算法提出修订的核密度估计方法。
原文链接：https://blog.youkuaiyun.com/pipisorry/article/details/53635895

Describing Distributions:

terminology

modes:

local or global distribution (unimode, bimodal, multimodal)

skewness:

skewed right, skewed wrong, symmetric.

tails

outliers

box plot and violin plots

请添加图片描述

Comparing Quantitative Distributions

overlaid histograms and density curves

side by side box plots and violin plots

Relationships between two quantitative variables

scatter plots:

derive relationship between pairs of numerical variables.

sns.lmplot(data = births, x = 'x', y = 'y', ci = False)
sns.jointplot(data = births, x = 'x', y = 'y')

hex plots:

two-dimensional version of density curve

sns.jointplot(data = births, x = 'x', y = 'y', kind = 'hex')

contour plots:

two-dimensional version of density curve

sns.jointplot(data = births, x = 'x', y = 'y', kind = 'kde', fill = True)

Interactive visualization

D3: https://d3js.org/

Tabulea: https://www.tableau.com/

Mpld3: https://mpld3.github.io/

Bokeh: https://bokeh.org/

Plotly: https://plotly.com/

Course: 6.859 : Interactive Data Visualization

http://vis.mit.edu/classes/6.859/

Network Data Analysis and visualization

terminology:

Centrality

a measure of nodes’ importance in the network

degree centrality

请添加图片描述

betweenness centrality

captures being part of shortest paths

请添加图片描述

想短时间获得较大转发量，就选betweenness 大的node

Closeness centrality

captures being close to all other nodes

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mFaalVig-1648209538804)(C:\Users\16539\AppData\Roaming\Typora\typora-user-images\image-20211206115732019.png)]

Community detection

many network visualization libraries:

Gephi (written in Java)

gephi.org

igraph

graph-tool (cannot install on windows, must be linux or MacOS)

cytoscape

shp format: shapefile format

other geospatial file format: shapefile , GeoJSON, KML, GPKG

usually used for restore geographic data like a map.

import geopandas as gpd
#如果文件是geospatial的
world = gpd.read_file(path)

#如果文件是csv
birds_df = pd.read_csv("../input/geospatial-learn-course-data/purple_martin.csv", parse_dates=['timestamp'])
birds = gpd.GeoDataFrame(birds_df, geometry=gpd.points_from_xy(birds_df["location-long"], birds_df["location-lat"]))