Probability And Statistics In Python:Distributions And Sampling

本文通过对美国各郡收入数据集的分析,探讨了不同教育水平下收入的中位数,并通过随机抽样验证了数据分布及学历对收入的影响。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Dataset

本文的数据集US的收入数据(income data),基本特征如下:

id – the county id.
county – the name and state of the county.
pop_over_25 – the number of adults over age 25.
median_income – the median income for residents over age 25 in the county.
median_income_no_hs – median income for residents without a high school education.
median_income_hs – median income for high school graduates who didn’t go to college.
median_income_some_college – median income for residents who went to college but didn’t graduate.
median_income_college – median income for college graduates.
median_income_graduate_degree – median income for those with a masters or other graduate degree.

  • 计算几个简单的量:收入中值最低的国家,25岁以上人口大于500000的国家里收入中值最低的国家(要学会这种用法,很简单但很有用)。
lowest_income_county = income["county"][income["median_income"].idxmin()]
high_pop = income[income["pop_over_25"] > 500000]
lowest_income_high_pop_county = high_pop["county"][high_pop["median_income"].idxmin()]
'''
lowest_income_county :'Starr County, Texas'
lowest_income_high_pop_county:'Miami-Dade County, Florida'
'''

Random Numbers

  • 不太理解[random.randint(0, 10) for _ in range(10)]中间有_是什么意思。
import random

# Returns a random integer between the numbers 0 and 10, inclusive.
num = random.randint(0, 10)

# Generate a sequence of 10 random numbers between the values of 0 and 10.
random_sequence = [random.randint(0, 10) for _ in range(10)]

# Sometimes, when we generate a random sequence, we want it to be the same sequence whenever the program is run.
# An example is when you use random numbers to select a subset of the data, and you want other people
# looking at the same data to get the same subset.
# We can ensure this by setting a random seed.
# A random seed is an integer that is used to "seed" a random number generator.
# After a random seed is set, the numbers generated after will follow the same sequence.
random.seed(10)
print([random.randint(0,10) for _ in range(5)])
random.seed(10)
# Same sequence as above.
print([random.randint(0,10) for _ in range(5)])
random.seed(11)
# Different seed means different sequence.
print([random.randint(0,10) for _ in range(5)])
random.seed(20)
new_sequence = [random.randint(0, 10) for _ in range(10)]
'''
[9, 0, 6, 7, 9]
[9, 0, 6, 7, 9]
[7, 8, 7, 7, 8]
'''

random.sample(随机抽样)

# Let's say that we have some data on how much shoppers spend in a store.
shopping = [300, 200, 100, 600, 20]

# We want to sample the data, and only select 4 elements.
random.seed(1)
shopping_sample = random.sample(shopping, 4)

# 4 random items from the shopping list.
print(shopping_sample)
'''
[200, 300, 20, 600]
'''
  • 种子设为1,然后生成一个中等的样本100掷骰,以及一个大样本10000次掷骰,可视化它的每个点出现的次数
import matplotlib.pyplot as plt

# A function that returns the result of a die roll.
def roll():
    return random.randint(1, 6)

random.seed(1)
medium_sample = [roll() for _ in range(100)]

plt.hist(medium_sample, 6)
plt.show()

random.seed(1)
large_sample = [roll() for _ in range(10000)]

plt.hist(large_sample, 6)
plt.show()

这里写图片描述

  • 发现采样越大,每个面出现的概率越相近。表明采样越大,概率越接近真实值。

采样的随机性

  • 第一个是顺序采样,第二个是随机采样
def get_sample_mean(start, end): # 某个段的收入均值
    return income["median_income"][start:end].mean()

def find_mean_incomes(row_step): # 顺序采样N:3143次,每次计算row_step行的收入均值
    mean_median_sample_incomes = []
    # Iterate over the indices of the income rows
    # Starting at 0, and counting in blocks of row_step (0, row_step, row_step * 2, etc).
    for i in range(0, income.shape[0], row_step):
        # Find the mean median for the row_step counties from i to i+row_step.
        mean_median_sample_incomes.append(get_sample_mean(i, i+row_step))
    return mean_median_sample_incomes #3143个均值

nonrandom_sample = find_mean_incomes(100) # 步长为100,每次计算100个顺序样本的均值
plt.hist(nonrandom_sample, 20) 
plt.show()

import random
def select_random_sample(count):
    random_indices = random.sample(range(0, income.shape[0]), count)
    return income.iloc[random_indices]

random.seed(1)
# 采样1000次,每次随机采样100个样本,求均值,得到1000个均值
random_sample = [select_random_sample(100)["median_income"].mean() for _ in range(1000)]
plt.hist(random_sample, 20)
plt.show()

这里写图片描述

  • 第一张图不能代表数据的分布,第二张图要相对准确些。

  • 观察是否学历对收入有影响?通过比较高中生和大学生的收入中值的比值来衡量,发现大部分集中在0.675那里,也就是高中生平均收入是大学生的67.5%。

def select_random_sample(count):
    random_indices = random.sample(range(0, income.shape[0]), count)
    return income.iloc[random_indices]

random.seed(1)
mean_ratios = []
for i in range(1000):
    sample = select_random_sample(100)
    # 高中生与大学生的比例
    ratios = sample["median_income_hs"] / sample["median_income_college"]
    mean_ratios.append(ratios.mean())

plt.hist(mean_ratios, 20)
plt.show()

这里写图片描述

Statistical Significance

  • 统计显著性就是计算这些值中超出均值的比例,一般来说统计显著性大于0.05就拒绝上述假设,表明产生这样的结果可能是随机性。我们得到的 significance value 是0.014,小于0.05,因此接受上面的判断。
significance_value = None
mean_higher = len([m for m in mean_ratios if m >= .675])
significance_value = mean_higher / len(mean_ratios)
print(significance_value )
'''
0.014
'''
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值