Dataset
本文的数据集US的收入数据(income data),基本特征如下:
id – the county id.
county – the name and state of the county.
pop_over_25 – the number of adults over age 25.
median_income – the median income for residents over age 25 in the county.
median_income_no_hs – median income for residents without a high school education.
median_income_hs – median income for high school graduates who didn’t go to college.
median_income_some_college – median income for residents who went to college but didn’t graduate.
median_income_college – median income for college graduates.
median_income_graduate_degree – median income for those with a masters or other graduate degree.
- 计算几个简单的量:收入中值最低的国家,25岁以上人口大于500000的国家里收入中值最低的国家(要学会这种用法,很简单但很有用)。
lowest_income_county = income["county"][income["median_income"].idxmin()]
high_pop = income[income["pop_over_25"] > 500000]
lowest_income_high_pop_county = high_pop["county"][high_pop["median_income"].idxmin()]
'''
lowest_income_county :'Starr County, Texas'
lowest_income_high_pop_county:'Miami-Dade County, Florida'
'''
Random Numbers
- 不太理解[random.randint(0, 10) for _ in range(10)]中间有_是什么意思。
import random
# Returns a random integer between the numbers 0 and 10, inclusive.
num = random.randint(0, 10)
# Generate a sequence of 10 random numbers between the values of 0 and 10.
random_sequence = [random.randint(0, 10) for _ in range(10)]
# Sometimes, when we generate a random sequence, we want it to be the same sequence whenever the program is run.
# An example is when you use random numbers to select a subset of the data, and you want other people
# looking at the same data to get the same subset.
# We can ensure this by setting a random seed.
# A random seed is an integer that is used to "seed" a random number generator.
# After a random seed is set, the numbers generated after will follow the same sequence.
random.seed(10)
print([random.randint(0,10) for _ in range(5)])
random.seed(10)
# Same sequence as above.
print([random.randint(0,10) for _ in range(5)])
random.seed(11)
# Different seed means different sequence.
print([random.randint(0,10) for _ in range(5)])
random.seed(20)
new_sequence = [random.randint(0, 10) for _ in range(10)]
'''
[9, 0, 6, 7, 9]
[9, 0, 6, 7, 9]
[7, 8, 7, 7, 8]
'''
random.sample(随机抽样)
# Let's say that we have some data on how much shoppers spend in a store.
shopping = [300, 200, 100, 600, 20]
# We want to sample the data, and only select 4 elements.
random.seed(1)
shopping_sample = random.sample(shopping, 4)
# 4 random items from the shopping list.
print(shopping_sample)
'''
[200, 300, 20, 600]
'''
- 种子设为1,然后生成一个中等的样本100掷骰,以及一个大样本10000次掷骰,可视化它的每个点出现的次数
import matplotlib.pyplot as plt
# A function that returns the result of a die roll.
def roll():
return random.randint(1, 6)
random.seed(1)
medium_sample = [roll() for _ in range(100)]
plt.hist(medium_sample, 6)
plt.show()
random.seed(1)
large_sample = [roll() for _ in range(10000)]
plt.hist(large_sample, 6)
plt.show()
- 发现采样越大,每个面出现的概率越相近。表明采样越大,概率越接近真实值。
采样的随机性
- 第一个是顺序采样,第二个是随机采样
def get_sample_mean(start, end): # 某个段的收入均值
return income["median_income"][start:end].mean()
def find_mean_incomes(row_step): # 顺序采样N:3143次,每次计算row_step行的收入均值
mean_median_sample_incomes = []
# Iterate over the indices of the income rows
# Starting at 0, and counting in blocks of row_step (0, row_step, row_step * 2, etc).
for i in range(0, income.shape[0], row_step):
# Find the mean median for the row_step counties from i to i+row_step.
mean_median_sample_incomes.append(get_sample_mean(i, i+row_step))
return mean_median_sample_incomes #3143个均值
nonrandom_sample = find_mean_incomes(100) # 步长为100,每次计算100个顺序样本的均值
plt.hist(nonrandom_sample, 20)
plt.show()
import random
def select_random_sample(count):
random_indices = random.sample(range(0, income.shape[0]), count)
return income.iloc[random_indices]
random.seed(1)
# 采样1000次,每次随机采样100个样本,求均值,得到1000个均值
random_sample = [select_random_sample(100)["median_income"].mean() for _ in range(1000)]
plt.hist(random_sample, 20)
plt.show()
第一张图不能代表数据的分布,第二张图要相对准确些。
观察是否学历对收入有影响?通过比较高中生和大学生的收入中值的比值来衡量,发现大部分集中在0.675那里,也就是高中生平均收入是大学生的67.5%。
def select_random_sample(count):
random_indices = random.sample(range(0, income.shape[0]), count)
return income.iloc[random_indices]
random.seed(1)
mean_ratios = []
for i in range(1000):
sample = select_random_sample(100)
# 高中生与大学生的比例
ratios = sample["median_income_hs"] / sample["median_income_college"]
mean_ratios.append(ratios.mean())
plt.hist(mean_ratios, 20)
plt.show()
Statistical Significance
- 统计显著性就是计算这些值中超出均值的比例,一般来说统计显著性大于0.05就拒绝上述假设,表明产生这样的结果可能是随机性。我们得到的 significance value 是0.014,小于0.05,因此接受上面的判断。
significance_value = None
mean_higher = len([m for m in mean_ratios if m >= .675])
significance_value = mean_higher / len(mean_ratios)
print(significance_value )
'''
0.014
'''