数据分析之----数值模型描述统计

最新推荐文章于 2024-05-10 08:34:51 发布

原创

最新推荐文章于 2024-05-10 08:34:51 发布 · 1.2k 阅读

1 ·

CC 4.0 BY-SA版权

处理普通文本

读取文本：read_csv() read_table()

方法参数	参数解释
filepath_or_buffer	文件路径
sep	列之间的分隔符。read_csv()默认为为’,’, read_table()默认为’\t’
header	默认将首行设为列名。`header=None`时应手动给出列名。
names	`header=None`时设置此字段使用列表初始化列名。
index_col	将某一列作为行级索引。若使用列表，则设置复合索引。
usecols	选择读取文件中的某些列。设置为为相应列的索引列表。
skiprows	跳过行。可选择跳过前n行或给出跳过的行索引列表。
encoding	编码。

写入文本：dataFrame.to_csv()

方法参数	参数解释
filepath_or_buffer	文件路径
sep	列之间的分隔符。默认为’,’
na_rep	写入文件时dataFrame中缺失值的内容。默认空字符串。
columns	定义需要写入文件的列。
header	是否需要写入表头。默认为True。
index	会否需要写入行索引。默认为True。
encoding	编码。

案例：读取电信数据集。

pd.read_csv('../data/CustomerSurvival.csv', header=None, index_col=0)

处理JSON

读取json：read_json()

方法参数	参数解释
filepath_or_buffer	文件路径
encoding	编码。

案例：读取电影评分数据：

pd.read_json('../data/ratings.json')

写入json：to_json()

方法参数	参数解释
filepath_or_buffer	文件路径；若设置为None，则返回json字符串
orient	设置面向输出格式：[‘records’, ‘index’, ‘columns’, ‘values’]

案例：

data = {
   
   'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df.to_json(orient='records')

其他文件读取方法参见：https://www.pypandas.cn/docs/user_guide/io.html

数值型描述统计（统计学）

算数平均值

$S = [s_1, s_2, …, s_n] $

样本中的每个值都是真值与误差的和。

$\frac{(s_1 + s_2 + ... + s_n) }{n}$

算数平均值表示对真值的无偏估计。

m = np.mean(array)
m = array.mean()
m = df.mean(axis=0)

案例：针对电影评分数据做均值分析：

mean = ratings['John Carson'].mean()
mean = np.mean(ratings['John Carson'])
means = ratings.mean(axis=1)

加权平均值

求平均值时，考虑不同样本的重要性，可以为不同的样本赋予不同的权重。

样本： $S = [s_1, s_2, s_3 ... s_n]$

权重： $W =[w_1, w_2, w_3 ... w_n]$

加权平均值：
$\frac{s_1w_1 + s_2w_2 + ... + s_nw_n}{w_1+w_2+...+w_n}$
代码实现：

a = np.average(array, weights=volumes)

案例：自定义权重，求加权平均。

# 加权均值
w = np.array([3,1,1,1,1,1,1])
np.average(ratings

最低0.47元/天解锁文章

200万优质内容无限畅学