Probability And Statistics In Python:Distributions And Sampling

最新推荐文章于 2024-06-25 14:43:58 发布

mmい

最新推荐文章于 2024-06-25 14:43:58 发布

阅读量770

点赞数

CC 4.0 BY-SA版权

分类专栏：数据挖掘—dataquest

本文链接：https://blog.youkuaiyun.com/zm714981790/article/details/51227968

数据挖掘—dataquest 专栏收录该内容

38 篇文章

订阅专栏

本文通过对美国各郡收入数据集的分析，探讨了不同教育水平下收入的中位数，并通过随机抽样验证了数据分布及学历对收入的影响。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Dataset

本文的数据集US的收入数据（income data），基本特征如下：

id – the county id.
county – the name and state of the county.
pop_over_25 – the number of adults over age 25.
median_income – the median income for residents over age 25 in the county.
median_income_no_hs – median income for residents without a high school education.
median_income_hs – median income for high school graduates who didn’t go to college.
median_income_some_college – median income for residents who went to college but didn’t graduate.
median_income_college – median income for college graduates.
median_income_graduate_degree – median income for those with a masters or other graduate degree.

计算几个简单的量：收入中值最低的国家，25岁以上人口大于500000的国家里收入中值最低的国家（要学会这种用法，很简单但很有用）。

lowest_income_county = income["county"][income["median_income"].idxmin()]
high_pop = income[income["pop_over_25"] > 500000]
lowest_income_high_pop_county = high_pop["county"][high_pop["median_income"].idxmin()]
'''
lowest_income_county ：'Starr County, Texas'
lowest_income_high_pop_county：'Miami-Dade County, Florida'
'''

Random Numbers

不太理解[random.randint(0, 10) for _ in range(10)]中间有_是什么意思。

import random

# Returns a random integer between the numbers 0 and 10, inclusive.
num = random.randint(0, 10)

# Generate a sequence of 10 random numbers between the values of 0 and 10.
random_sequence = [random.randint(0, 10) for _ in range(10)]

# Sometimes, when we generate a random sequence, we want it to be the same sequence whenever the program is run.
# An example is when you use random numbers to select a subset of the data, and you want other people
# looking at the same data to get the same subset.
# We can ensure this by setting a random seed.
# A random seed is an integer that is used to "seed" a random number generator.
# After a random seed is set, the numbers generated after will follow the same sequence.
random.seed(10)
print([random.randint(0,10) for _ in range(5)])
random.seed(10)
# Same sequence as above.
print([random.randint(0,10) for _ in range(5)])
random.seed(11)
# Different seed means different sequence.
print([random.randint(0,10) for _ in range(5)])
random.seed(20)
new_sequence = [random.randint(0, 10) for _ in range(10)]
'''
[9, 0, 6, 7, 9]
[9, 0, 6, 7, 9]
[7, 8, 7, 7, 8]
'''

random.sample(随机抽样)

# Let's say that we have some data on how much shoppers spend in a store.
shopping = [300, 200, 100, 600, 20]

# We want to sample the data, and only select 4 elements.
random.seed(1)
shopping_sample = random.sample(shopping, 4)

# 4 random items from the shopping list.
print(shopping_sample)
'''
[200, 300, 20, 600]
'''

种子设为1，然后生成一个中等的样本100掷骰,以及一个大样本10000次掷骰，可视化它的每个点出现的次数

import matplotlib.pyplot as plt

# A function that returns the result of a die roll.
def roll():
    return random.randint(1, 6)

random.seed(1)
medium_sample = [roll() for _ in range(100)]

plt.hist(medium_sample, 6)
plt.show()

random.seed(1)
large_sample = [roll() for _ in range(10000)]

plt.hist(large_sample, 6)
plt.show()

这里写图片描述

发现采样越大，每个面出现的概率越相近。表明采样越大，概率越接近真实值。

采样的随机性

第一个是顺序采样，第二个是随机采样

def get_sample_mean(start, end): # 某个段的收入均值
    return income["median_income"][start:end].mean()

def find_mean_incomes(row_step): # 顺序采样N：3143次，每次计算row_step行的收入均值
    mean_median_sample_incomes = []
    # Iterate over the indices of the income rows
    # Starting at 0, and counting in blocks of row_step (0, row_step, row_step * 2, etc).
    for i in range(0, income.shape[0], row_step):
        # Find the mean median for the row_step counties from i to i+row_step.
        mean_median_sample_incomes.append(get_sample_mean(i, i+row_step))
    return mean_median_sample_incomes #3143个均值

nonrandom_sample = find_mean_incomes(100) # 步长为100，每次计算100个顺序样本的均值
plt.hist(nonrandom_sample, 20) 
plt.show()

import random
def select_random_sample(count):
    random_indices = random.sample(range(0, income.shape[0]), count)
    return income.iloc[random_indices]

random.seed(1)
# 采样1000次，每次随机采样100个样本，求均值，得到1000个均值
random_sample = [select_random_sample(100)["median_income"].mean() for _ in range(1000)]
plt.hist(random_sample, 20)
plt.show()

这里写图片描述

第一张图不能代表数据的分布，第二张图要相对准确些。
观察是否学历对收入有影响？通过比较高中生和大学生的收入中值的比值来衡量，发现大部分集中在0.675那里，也就是高中生平均收入是大学生的67.5%。

def select_random_sample(count):
    random_indices = random.sample(range(0, income.shape[0]), count)
    return income.iloc[random_indices]

random.seed(1)
mean_ratios = []
for i in range(1000):
    sample = select_random_sample(100)
    # 高中生与大学生的比例
    ratios = sample["median_income_hs"] / sample["median_income_college"]
    mean_ratios.append(ratios.mean())

plt.hist(mean_ratios, 20)
plt.show()

这里写图片描述

Statistical Significance

统计显著性就是计算这些值中超出均值的比例，一般来说统计显著性大于0.05就拒绝上述假设，表明产生这样的结果可能是随机性。我们得到的 significance value 是0.014，小于0.05，因此接受上面的判断。

significance_value = None
mean_higher = len([m for m in mean_ratios if m >= .675])
significance_value = mean_higher / len(mean_ratios)
print(significance_value )
'''
0.014
'''