Dataquest学习总结[7]

最新推荐文章于 2020-02-24 09:36:03 发布

sodleave

最新推荐文章于 2020-02-24 09:36:03 发布

阅读量540

点赞数

分类专栏： python数据分析

本文链接：https://blog.youkuaiyun.com/sodleave/article/details/72582294

版权

python数据分析专栏收录该内容

12 篇文章

订阅专栏

继续Step 5: Statistics And Linear Algebra/Probability And Statistics In Python: Intermediate

Introduction to probability

Calculating Probabilities

>>数据集bike sharing Dataset，地址here

地板除//，5//4=1

计算阶乘：math.factorial(N)

p = .6
q = .4
import math
def calc_prob(total,days):
    per_pro=(p**(days))*(q**(total-days))
    num=math.factorial(total)/math.factorial(days)/math.factorial(total-days)
    return per_pro*num
prob_8=calc_prob(10,8)

Probability distributions

import math
# Each item in this list represents one k, starting from 0 and going up to and including 30.
outcome_counts = list(range(31))
#手写二项分布的代码
def calc_prob(N,k,p,q):
    prob=(p**k)*(q**(N-k))
    count=math.factorial(N)/math.factorial(k)/math.factorial(N-k)
    return prob*count
outcome_probs=[]
for i in outcome_counts:
    outcome_probs.append(calc_prob(30,i,.39,.61))

#利用scipy库进行二项分布求解
import scipy
from scipy import linspace
from scipy.stats import binom
# Create a range of numbers from 0 to 30, with 31 elements (each number has one entry).
outcome_counts = linspace(0,30,31)
outcome_probs=binom.pmf(outcome_counts,30,0.39)
plt.bar(outcome_counts,outcome_probs)
plt.show()

#二项分布均值Np,方差Npq
#进行试验的测试足够多时，二项分布近似正态分布
#累计概率密度,binom.cdf()
# The sum of all the probabilities to the left of k, including k.
left = binom.cdf(k,N,p)
# The sum of all probabilities to the right of k.
right = 1 - left

Significance Testing : p-value，置信区间的概念

Chi-squared tests

产生0.0~1.0之间随机数numpy.random.random(a,b)，返回a*b维的ndarray

#手动产生卡方分布
chi_squared_values = []
for i in range(1000):
    numbers=numpy.random.random(32561,)
    for i in range(len(numbers)):
        if numbers[i]<0.5:
            numbers[i]=0
        else:
            numbers[i]=1
    mal=32561-numpy.sum(numbers)
    femal=numpy.sum(numbers)
    male_diff=(mal-16280.5)**2/16280.5
    female_diff=(femal-16280.5)**2/16280.5
    chi_squared_values.append(male_diff+female_diff)
plt.hist(chi_squared_values)
plt.show()

#利用scipy产生卡方值
from scipy.stats import chisquare
observed = np.array([5, 10, 15])
expected = np.array([7, 11, 12])
chisquare_value, pvalue = chisquare(observed, expected)

Multi category chi-squared tests
pandas.crosstab 计算DataFrame表中的各项频次关系

import pandas
table = pandas.crosstab(income["sex"], [income["race"]])
print(table)

scipy.stats.chi2_contingency 函数返回一些卡方分布参数

from scipy.stats import chi2_contingency
table=pandas.crosstab(income['sex'],[income['race']])
chisq_value, pvalue, df, expected= chi2_contingency(table)
pvalue_gender_race=pvalue

Guided Project: Winning Jeopardy

代码 here 数据集here

list.remove() 可以直接修改list，移除第一个匹配项，但是没有返回值

Solving Systems of Equations with Matrices/vectors

#矩阵行变换
import numpy as np
matrix = np.asarray([
    [2, 1, 25],
    [3, 2, 40]  
], dtype=np.float32)
matrix[0]*=2
matrix[0]-=matrix[1]
matrix[1]-=(matrix[0]*3)
matrix[1]/=2
#行与行进行交换
matrix[[0,2]] = matrix[[2,0]]

#对多个向量作图
import numpy as np
import matplotlib.pyplot as plt
# We're going to plot two vectors
# The first will start at origin 0,0, then go over 1 and up 2
# The second will start at origin 1,2, then go over 3 and up 2
# The third will start at origin 0,0, then go over 4 and up 4
X = [0,1,0]
Y = [0,2,0]
U = [1,3,4]
V = [2,2,4]
plt.quiver(X, Y, U, V, angles='xy', scale_units='xy', scale=1)
plt.xlim([0,6])
plt.ylim([0,6])
plt.show()
#矩阵相乘numpy.dot(A,B)