选择题10*2
1关于python格式要求的规定
2关于python字符串的使用(切片(头下标,尾下标),索引:正从0开始,负号从-1开始)
3关于python注释(使用井号(‘#’)作为单行注释的符号,多行注释指的是可以一次性注释程序中多行的内容(包含一行)。多行注释的语法有 2 种 ''' '''或者""" """)
4关于import方式(
import math
from math import pi
from math import *
)
5判断数据类型(python的数据类型有:
数字(int)、浮点(float)、字符串(str),列表(list大括号)、元组(tuple小括号)、字典(dict键值对关键字符冒号)、集合(set 啥都能放))
6关于python常见序列的操作(in 属于,not in不输入)
7数据分析的常用的库(numpy,scipy,mat,equesuts,scrapy,pillow,twisted,numpy,matplotlib,pygama,ipyhton)
8基本索引切片(同第二题)
9字典的访问(
1)通过“键值对”(key-value)访问
2)遍历字典:
dict = {1: 1, 2: 'aa', 'D': 'ee', 'Ty': 45}
for item in dict.items():
print(item))
10简单的列表的遍历循环
1)简单的for循环结构[ for i in list: print(list) ]
2)借用 range() 和 len() 函数遍历
cities = ["Guangzhou","Beijing","Shanghai","Nanjing"]
for i in range(len(cities)):
print(i+1,cities[i])
3)利用enumerate() 函数。这种方法可以输出序列号,也可以不输出。
cities = ["Guangzhou","Beijing","Shanghai","Nanjing"]
for i,city in enumerate(cities):
print(i+1,city)
填空题15*2
1关于机器学习线性回归的函数
损失函数,代价函数,目标函数
2线性回归中寻找代价函数最小值的方法
梯度下降法
3在机器学习的算法中常见的描述
1) 线性回归*
2) 逻辑回归*
3) 监督学习
4) 非监督学习
3) 强化学习
4在分类时常用的算法,特殊的函数落在0-1之间
(归一化:把数据映射到0~1范围之内处理,更加便捷快速,应该归到数字信号处理范畴之内。例1:{2.5 3.5 0.5 1.5}归一化后变成了{0.3125 0.4375 0.0625 0.1875}解:2.5+3.5+0.5+1.5=8,2.5/8=0.3125,3.5/8=0.4375。)
5在什么样的数据中不适用准确率, 召回率, 查准率(英文表示)
样本容量很小
准确率(Accuracy) 召回率(Recall) 查准率(Precision)
6准确率,查准率,召回率的计算公式
Positive Negative
true TP NP
false FP FN
准确率(Accuracy) =(TP+TN)/TOTAL
查准率(Precision) =TP/(TP+FP)
召回率(Recall) =TP/(TP+FN)
F1=准确率*召回率*2/(准确率+召回率)
3问答题1*10
打鱼题目,分类大小的变化对各种概率的影响
例1.某池塘有1400条鲤鱼,300只虾,300只鳖。
现在以捕鲤鱼为目的。撒一大网,逮着了700条鲤鱼,200只虾,100只鳖
那么,这些指标分别如下:
查准率 = 700 / (700 + 200 + 100) = 70%
召回率 = 700 / 1400 = 50%
F值 = 70% * 50% * 2 / (70% + 50%) = 58.3%
例2.还是这个池塘,有1400条鲤鱼,300只虾,300只鳖。
如果把池子里所有的鲤鱼、虾和鳖都一网打尽(一共2000个),这些指标有何变化:
查准率 = 1400 / (1400 + 300 + 300) = 70%
召回率 = 1400 / 1400 = 100%
F值 = 70% * 100% * 2 / (70% + 100%) = 82.35%
1.提高分类阈值(就是样本不那么容易被预测为正类了),Precision可能会提高(因为FP可能会减小);
但是Recall会下降或保持不变(因为TP会减少或不变,且FN会增加或不变)
2. 降低分类阈值(会更多地将样本预测为正类),Precision可能会下降(因为FP可能会增加),而Recall(FN可能会减少)可能会有所提高
提高精确率通常会降低召回率值,反之亦然
4基础编程题5*4
1英文句子中输出最长的句子,并给出字符串长度
def max_len_word(_in):
words=_in.replace(',','').replace('.','').split(' ')
max_len = 0
max_word = 0
for i in words:
l = len(i)
if l>max_len:
max_len=l
max_word=i
print(max_word,max_len)
2使用匿名函数的方式排序
字典的排序:分别用键和值排序
list1 = [('david', 90), ('mary', 90), ('sara', 80), ('lily', 95)]
def dict_sort(list1)
L1 = sorted(list1, key=lambda x: x[0])
L2 = sorted(list1, key=lambda x: x[1])
print(L1)
print(L2)
3编写一个简单的递归函数
def x(num):
if num==0:
return 1
else:
return num*x(num-1)
4使用一个numpy模块,创建一个数组,进行形状变换,对数组取值切片
import numpy as np
a = np.array([
[0,1,2,3,4,5,],
[6,7,8,9,10,11]
]
)
法一:b=a.reshape((2,2,3))
法二:a.resize((3,4))
区别:法一不会直接改变数组形状
5定义一个不定长参数的函数应用
def stu_info(name,age,sex,*tips):
print("姓名:",name)
print("年龄:",age)
print("性别:",sex)
print("备注:",tips)
5综合编程题2*10
1糖尿病的线性回归
from matplotlib import pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
# 使用单特征线性回归来预测糖尿病数据量
plt.rcParams['font.family']='SimHei'
plt.rcParams['axes.unicode_minus'] = False
def get_data():
diabetes_data = load_diabetes()
print(diabetes_data['feature_names'])
data_bmi = diabetes_data.data[:,2].reshape(-1,1)
print(data_bmi.shape)
data_y = diabetes_data.target[:]
print(data_y)
return data_bmi, data_y
def draw_scatter(x, y):
plt.scatter(x,y,label='BMI-糖尿病散点图')
plt.legend()
plt.show()
def build_and_train(x, y):
model = LinearRegression()
model.fit(x, y)
print(f"模型的截距为:{model.intercept_:2f}")
print(f"模型的系数为:{model.coef_[0]:2f}")
return model
def predict_and_plot(model, x, y):
pred_y=model.predict(x)
plt.plot(x, pred_y, color='magenta', linewidth=3,label='拟合直线')
print(model.score(x,y))
if __name__=="__main__":
x, y = get_data()
draw_scatter(x, y)
model = build_and_train(x, y)
predict_and_plot(model,x,y)
2建立一个逻辑回归分类器
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib as plt
def get_data():
cancer = load_breast_cancer()
# print(cancer['data'].shape)
# print(cancer['target'].shape)
# train_test_split(cancer['data'], cancer['target'])# 特征值和目标值
x_train,x_test,y_train,y_test=train_test_split(cancer['data'],
cancer['target'],
test_size=0.25,
shuffle=True,
random_state=49)
return x_train,x_test,y_train,y_test
def build_and_train(x_train,y_train):
model=LogisticRegression(max_iter=3000)
model.fit(x_train,y_train)
model.score(x_train,y_train)
print("Strat training")
print(f"分类器在训练集上的正确率为{model.score(x_train,y_train):.2%}")
return model
def evaluate_with_cv(model,x_train,y_train):
print("Strat cross-validation")
scores=cross_val_score(model,
x_train,
y_train,
cv=5,
scoring='accuracy')
# print(scores)
print(f"在训练集上五折交叉验证的平均正确率Accuracy为{np.mean(scores):.2%}")
scores = cross_val_score(model,
x_train,
y_train,
cv=5,
scoring='precision')
print(f"在训练集上五折交叉验证的平均查准率Accuracy为{np.mean(scores):.2%}")
scores = cross_val_score(model,
x_train,
y_train,
cv=5,
scoring='recall')
print(f"在训练集上五折交叉验证的平均召回率Accuracy为{np.mean(scores):.2%}")
return model
def evaluate_on_testset(model,x_test,y_test):
print("Strat testing")
score=model.score(x_test,y_test)
print(f"在测试集上的平均正确率Accuracy为{score:.2%}")
def draw_confusion_matrix(model,x_test,y_test):
y_pred=model.predict(x_test)
cm=confusion_matrix(y_test,
y_pred
)
plt.rcParams['font.family'] ='SimHei'
plt.matshow(cm)
plt.title("混淆矩阵")
plt.colorbar()
plt.xlabel("实际值")
plt.ylabel("预测值")
for i in range(len(cm)):
for j in range(len(cm)):
plt.annotate(cm[i, j],
xy=(i, j),
horizontalalignment='center',
verticalaligment='canter')
plt.show()
plt.savefig('breast_cancer_cm_fig_2021.png',bbox_inches='tight')
print(cm)
tn, fp, fn, tp=cm.ravel()
print(tn, fp, fn, tp)
if __name__=='__main__':
x_train,x_test,y_train,y_test=get_data()
model=build_and_train(x_train,y_train)
evaluate_with_cv(model,x_train,y_train)
evaluate_on_testset(model,x_test,y_test)
draw_confusion_matrix(model,x_test,y_test)