pandas.DataFrame.corr &scatter_matrix计算各个属性之间相关系数

最新推荐文章于 2024-06-13 15:18:26 发布

qq_40949544

最新推荐文章于 2024-06-13 15:18:26 发布

阅读量1.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/qq_40949544/article/details/88224799

python 专栏收录该内容

8 篇文章

订阅专栏

本文通过使用Pandas库进行数据探索，分析了房价数据集中的相关性，并利用matplotlib和scatter_matrix函数创建了属性之间的散点图矩阵，揭示了中位数房价与收入、房间数量等关键因素的关系。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

官方文档

参考博客

corr_matrix=housing.corr()
print(corr_matrix)

print(corr_matrix["median_house_value"].sort_values(ascending=False))

pandas.plotting.scatter_matrix官方文档

from pandas.tools.plotting import scatter_matrix
attributes=["median_house_value","median_income","total_rooms","housing_median_age"]
scatter_matrix(housing[attributes],figsize=(12,8)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

qq_40949544

关注关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫
举报

举报

专栏目录

pandas相关系数矩阵

Jon_Sheng的博客

03-26

3万+

pandas.Dataframe.corr(method='spearman')代码效率极低，会占用大量资源

weixin_44605402的博客

02-04

3202

参与评论您还未登录，请先登录后发表或查看评论

散布矩阵（Scatter Matrix）（一）

积沙成塔

05-12

2万+

参考网页：http://en.wikipedia.org/wiki/Scatter_matrix

Python pandas.DataFrame.corr函数方法的使用

weixin_42098295的博客

06-19

1296

Pandas是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。Pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。本文主要介绍一下Pandas中pandas.DataFrame.corr方法的使用。原文地址：Python pandas.DataFrame.corr函数方法的使用 ...

weixin_39640543的博客

12-05

2505

机器学习把我整的有点蒙了，今天写点简单的。什么是相关系数矩阵？其实这个东西在数据分析中经常用到，相关系数矩阵可以一次性同时显示多个变量之间的相关关系。当我们做相关系数矩阵时，我们会得到一个包含每个变量与其他变量之间的相关系数的表。表中的系数显示了关系的强度及其方向(正或负相关)。在Python中，我们可以使用Pandas和NumPy创建相关系数矩阵。相关矩阵的应用在写代码之前呢，再给大家明确一下什...

import numpy as np import statsmodels.tsa.stattools as sts import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import statsmodels.api as sm # X = np.random.randn(1000) # Y = np.random.randn(1000) # plt.scatter(X,Y) # plt.show() data = pd.DataFrame(pd.read_excel(r'C:\Users\ivanss\Desktop\groud.xlsx')) # X = np.array(data[['Water heat']]) # Y = np.array(data[['pH']]) import numpy as np from scipy.stats import pearsonr #输入数组 x = np.array(data[['Water heat']]) y = np.array(data[['pH']]) #从二维数组转变成一维数组 x = x.squeeze() y = y.squeeze() print(x.shape, y.shape) # 检测无效值 # 将NaN和inf替换为可过滤的值 x_clean = x[~np.isnan(x) & ~np.isinf(x)] y_clean = y[~np.isnan(y) & ~np.isinf(y)] # 确保x和y长度一致 min_length = min(len(x_clean), len(y_clean)) x_final = x_clean[:min_length] y_final = y_clean[:min_length] # print(x_final) # 添加极其小的偏移量，避免除0 x = np.log(x + 1e-10) y = np.log(y + 1e-10) #输出pearsonr相关系数 from scipy.stats import pearsonr corr, p_value = pearsonr(x_final, y_final) print("水温和PH相关系数:", corr) x = np.array(data[['Water heat']]) y = np.array(data[['DO']]) #从二维数组转变成一维数组 x = x.squeeze() y = y.squeeze() print(x.shape, y.shape) # 检测无效值 # 将NaN和inf替换为可过滤的值 x_clean = x[~np.isnan(x) & ~np.isinf(x)] y_clean = y[~np.isnan(y) & ~np.isinf(y)] # 确保x和y长度一致 min_length = min(len(x_clean), len(y_clean)) x_final = x_clean[:min_length] y_final = y_clean[:min_length] # print(x_final) # 添加极其小的偏移量，避免除0 x = np.log(x + 1e-10) y = np.log(y + 1e-10) #输出pearsonr相关系数 from scipy.stats import pearsonr corr, p_value = pearsonr(x_final, y_final) print("水温和DO相关系数:", corr) 我想将每一列的相关系列都测出来汇聚成heatmap 同时缺失值和异常值选择用填充而不是删除该怎么做

03-08

嗯，用户想用Python中的Pandas和Seaborn计算DataFrame所有列之间的Pearson相关系数，并生成热力图。同时要处理缺失值和异常值，采用填充而不是删除。首先，我需要回忆一下Pandas处理缺失值和异常值的方法，以及如何...

03-08

1. 使用pandas的DataFrame.corr()方法，该方法可以直接计算所有数值列之间的相关系数矩阵，默认使用皮尔逊相关系数。这可能是最快捷的方式，但需要确保数据已经处理好缺失值和无效值，并且不需要额外的对数变换。...

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import f_classif from statsmodels.stats.outliers_influence import variance_inflation_factor # 1. 计算特征相关性矩阵 corr_matrix = X.corr() plt.figure(figsize=(15, 12)) sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", annot_kws={"size": 8}, cbar_kws={"shrink": 0.8}) plt.title("特征相关性热力图") plt.show() # 2. 计算VIF（方差膨胀因子）检测多重共线性 vif_data = pd.DataFrame() vif_data["Feature"] = feature_columns vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(feature_columns))] print("多重共线性诊断 (VIF > 5 表示高共线性):") print(vif_data.sort_values("VIF", ascending=False)) # 3. 创建统一的重要性对比表 # 计算F检验重要性（单变量） f_scores, _ = f_classif(X, y) f_importance = f_scores / f_scores.max() # 标准化 # 计算模型系数重要性（多变量） coef_importance = np.abs(model.coef_[0]) coef_importance = coef_importance / coef_importance.max() # 标准化 # 创建对比DataFrame importance_df = pd.DataFrame({ "Feature": feature_columns, "F_Importance": f_importance, "Coef_Importance": np.nan # 初始化为NaN }) # 仅填充被选择的特征 selected_indices = selector.get_support(indices=True) for idx in selected_indices: importance_df.at[idx, "Coef_Importance"] = coef_importance[list(selected_indices).index(idx)] # 添加差异指标 importance_df["Importance_Diff"] = np.abs( importance_df["F_Importance"] - importance_df["Coef_Importance"] ) print("\n统一特征重要性对比:") print(importance_df.sort_values("Importance_Diff", ascending=False)) # 4. 可视化对比 plt.figure(figsize=(14, 8)) plt.scatter(importance_df["F_Importance"], importance_df["Coef_Importance"], s=100) # 添加标签和参考线 for i, row in importance_df.iterrows(): if not np.isnan(row["Coef_Importance"]): plt.text(row["F_Importance"] + 0.02, row["Coef_Importance"] + 0.02, row["Feature"], fontsize=9) else: plt.text(row["F_Importance"] + 0.02, 0.02, f"{row['Feature']} (未选择)", fontsize=9, color="red") plt.axline((0, 0), slope=1, color="red", linestyle="--", alpha=0.5) plt.xlabel("单变量重要性 (F检验)") plt.ylabel("多变量重要性 (模型系数绝对值)") plt.title("单变量与多变量特征重要性对比") plt.grid(True, alpha=0.3) plt.show() # 5. 基于领域知识调整模型 # 示例：强制包含临床重要特征 clinical_features = ["年龄", "GCS", "意识情况"] # 临床重要特征 # 创建新的特征选择器 from sklearn.base import BaseEstimator, TransformerMixin class ClinicalFeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, clinical_features, k=10): self.clinical_features = clinical_features self.k = k self.selector = None def fit(self, X, y=None): # 首先确保包含临床重要特征 clinical_indices = [list(X.columns).index(f) for f in self.clinical_features if f in X.columns] # 使用SelectKBest选择其他特征 self.selector = SelectKBest(f_classif, k=self.k - len(clinical_indices)) other_features = [f for f in X.columns if f not in self.clinical_features] self.selector.fit(X[other_features], y) return self def transform(self, X): clinical_data = X[self.clinical_features].values other_data = self.selector.transform(X[[f for f in X.columns if f not in self.clinical_features]]) return np.hstack([clinical_data, other_data]) def get_support(self): clinical_mask = [True if f in self.clinical_features else False for f in feature_columns] other_mask = self.selector.get_support() return np.array(clinical_mask + list(other_mask)) # 使用新的特征选择器 clinical_selector = ClinicalFeatureSelector(clinical_features=clinical_features, k=10) X_clinical = clinical_selector.fit_transform(X, y) # 重新训练模型 model_clinical = LogisticRegression(max_iter=1000, random_state=42) model_clinical.fit(X_clinical, y_res) # 比较特征选择结果 print("\n原始特征选择 vs 临床调整特征选择:") print("原始选择:", [feature_columns[i] for i in selector.get_support(indices=True)]) print("临床调整:", [feature_columns[i] for i in clinical_selector.get_support(indices=True)]) 显示好多未引用，改

07-23

corr_matrix = X.corr() sns.heatmap(corr_matrix, annot=False, fmt=".2f", cmap="coolwarm", cbar_kws={"shrink": 0.8}) plt.title("特征相关性热力图") plt.tight_layout() plt.show() # 2. 计算VIF（方差膨胀...

python下的Pandas中DataFrame基本操作（一），基本函数整理

热门推荐

daydayup_668819的博客

09-02

6万+

pandas作者Wes McKinney 在【PYTHON FOR DATA ANALYSIS】中对pandas的方方面面都有了一个权威简明的入门级的介绍，谈到pandas数据的行更新、表合并等操作，一般用到的方法有concat、join、merge。但这三种方法对于很多新手...

最新发布

08-09

- `scaler`和`model`作为类属性：确保预处理和模型状态一致 - 线性回归本质：$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon $ --- #### 3. **数据加载与预处理** ```python def load_data(self, filename: ...

详解pandas绘制矩阵散点图(scatter_matrix)的方法

12-20

使用散点图矩阵图，可以两两发现特征之间的联系 pd.plotting.scatter_matrix(frame, alpha=0.5, c,figsize=None, ax=None, diagonal='hist', marker='.', density_kwds=None,hist_kwds=None, range_padding=0.05, **kwds) 1、frame，pandas dataframe对象 2、alpha，图像透明度，一般取(0,1] 3、figsize，以英寸为单位的图像大小，一般以元组 (width, height) 形式设置 4、ax，可选一般为none

数据特征分析：相关性分析（Pandas中的corr方法）

01-21

【Pandas】深入解析pandas中的统计汇总函数`corr()`

qq_38614074的博客

06-13

4437

Pandas的corr()函数为我们提供了计算DataFrame中不同列之间相关性系数的功能，是数据分析中不可或缺的工具之一。通过深入了解其使用方法和扩展功能，我们可以更好地利用这个函数来探索数据中的相关性，为后续的数据可视化和建模提供有力支持。同时，我们也需要注意相关性并不等于因果关系，以及相关性系数可能无法完全捕捉变量之间的非线性关系。

机器学习-数据处理：使用corr()和scatter_matrix()函数寻找属性之间的相关性

Wang_PChao的博客

03-02

5097

在机器学习的数据分析阶段，找到那些和待预测量有较强关联的特征对解决问题有非常大的帮助，为解决这个问题，我们可以分别使用pandas提供的两个函数corr()和scatter_matrix() 使用corr()函数计算每对属性之间的标准相关系数 函数原型 DataFrame.corr(self, method='pearson', min_periods=1) 函数功能计算数值列的两两相...

【Pandas数据处理100例】（九十五）：Pandas使用corr()计算DataFrame的相关性系数

优快云精品推荐

11-30

922

该函数可以计算DataFrame的相关性系数method：统计学中的三大相关性系数：pearson, spearman, kendall。

Scatter matrix（散布矩阵）

11-24

5520

nn 个 mm 维的样本，Xm×n=[x1,x2,…,xn]X_{m\times n}=\left[\mathrm x_1,\mathrm x_2,\ldots,\mathrm x_n\right]，样本均值定义为：x¯=1n∑i=1nxi \bar {\mathrm x}=\frac1n\sum_{i=1}^n\mathrm x_i 散列矩阵定义为如下的半正定矩阵：S=∑j=1n(xj−x¯)(x

异常检测中计算属性间相似度的各种方法（二）

fatfairyyy的博客

07-05

320

通过案例学习pandas计算相关系数

高级数据分析师，分享Python知识

08-18

1276