ml note

本文深入探讨了机器学习中的基础算法,包括逻辑回归、支持向量机、决策树等,并提供了详细的数学公式和Python实现代码示例。

损失函数:http://www.csie.ntu.edu.tw/~cjlin/liblinear/  http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture14.pdf

   线性model 是 平方损失,绝对值损失

  分类问题有hinge loss (svm),logic loss(logistic regression)


  logistic regression : 

1 / (1 + exp(-wTx)) 

,logic loss:

log(1 + exp(-ywTx))   y 属于1 或-1


svm分类: max(0,1-y((w*x)+b)) +a||w||**2, 间隔最大


贝叶斯:概率来的,分为feature是离散值还是连续值(转化为高斯分布来计算概率)

   要点是首先,计算每种分类(yj)下的每个feature的mean及stdev值,然后P(yj|x)=连乘p(xi|yj)*p(yj),p(yj)是根据该类分类数目占总数目的比例。如下时python程序,

决策树 :1)熵增益形:H(D)=summarize(pklogpk)k 为类别, H(D|A)= summarize(p(Di)summarize(p(Dik)logp(Dik))),其中p(Di)=|Di|/|D| , p(Dik)= |Dik|/|Di|,|Di|表示featureA中Di情况的数目, 增益gain = H(D)-H(D|A) ,每次比较剩余feature 的gain,gain大的优先选为节点(NOde), 这个树可能是多叉树,看Di(i的数目)

               2)熵增益比,与第一种的差别是有gainR = gain/HA(D)决定哪个feature优先选为节点, 其中HA(D)= summarize(p(Di)logp(Di))


             3)cart(classification and regression tree), 由基尼系数最小的优先选为节点, gini(D) = summarize(pk(1-pk)),

                gini(D|A)= summarize(p(Di)summarize(p(Dik)(1-p(Dik)))(二叉树)

               比较每个featureA 的在不同情况下的gini, 选择有最小gini 的feature中的种情况作为一个二叉树节点

# CART on the Bank Note dataset
from random import seed
from random import randrange
from csv import reader


# Load a CSV file
def load_csv(filename):
file = open(filename, "rb")
lines = reader(file)
dataset = list(lines)
return dataset


# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())


# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
dataset_split = list()
dataset_copy = list(dataset)
fold_size = int(len(dataset) / n_folds)
for i in range(n_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split


# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0


# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
folds = cross_validation_split(dataset, n_folds)
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set, [])
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None
predicted = algorithm(train_set, test_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual, predicted)
scores.append(accuracy)
return scores


# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right


# Calculate the Gini index for a split dataset
def gini_index(groups, class_values):
gini = 0.0
for class_value in class_values:
for group in groups:
size = len(group)
if size == 0:
continue
proportion = [row[-1] for row in group].count(class_value) / float(size)
gini += (proportion * (1.0 - proportion))
return gini


# Select the best split point for a dataset
def get_split(dataset):
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
for index in range(len(dataset[0])-1):
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini, groups
return {'index':b_index, 'value':b_value, 'groups':b_groups}


# Create a terminal node value
def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)


# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
left, right = node['groups']
del(node['groups'])
# check for a no split
if not left or not right:
node['left'] = node['right'] = to_terminal(left + right)
return
# check for max depth
if depth >= max_depth:
node['left'], node['right'] = to_terminal(left), to_terminal(right)
return
# process left child
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_split(left)
split(node['left'], max_depth, min_size, depth+1)
# process right child
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_split(right)
split(node['right'], max_depth, min_size, depth+1)


# Build a decision tree
def build_tree(train, max_depth, min_size):
root = get_split(dataset)
split(root, max_depth, min_size, 1)
return root


# Make a prediction with a decision tree
def predict(node, row):
if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']


# Classification and Regression Tree Algorithm
def decision_tree(train, test, max_depth, min_size):
tree = build_tree(train, max_depth, min_size)
predictions = list()
for row in test:
prediction = predict(tree, row)
predictions.append(prediction)
return(predictions)


# Test CART on Bank Note dataset
seed(1)
# load and prepare data
filename = 'data_banknote_authentication.csv'
dataset = load_csv(filename)
# convert string attributes to integers
for i in range(len(dataset[0])):
str_column_to_float(dataset, i)
# evaluate algorithm
n_folds = 5
max_depth = 5
min_size = 10
scores = evaluate_algorithm(dataset, decision_tree, n_folds, max_depth, min_size)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

这个程序的gini计算少了p(Di),导致了accuracy下降的不少


https://docs.python.org/2/library/re.html#regular-expression-objects //python 库查网站


spark with ml //pyspark

1)initialize sc = SparkContext('local[4]', 'app name')

2)form RDD RDD= sc.textFile(filename)  or  RDD=sc.parallelize(list/array)

3) RDD 's map() ,groupByKey(),reduceByKey(),sortByKey(),sortBy(function wt by ourself), filter, combineByKey

4) merge two RDD , join(), leftOuterJoin(), rightOutJoin, fullOuterJoin, notice , the join's result only allow two column , also before join, first union the column to one (key, (, , ,)) ,or it will miss data after merge  , for union(), just like push_back in c++

5) in spark , to let the result be the same for each , we can't just sort the data by sortByKey , but we can first merge key+value to string ,than use sortBy, so each run ,the result will be the same

6) action function, take()//every time the result maybe not the same, takeOrdered(number, key=sortfunction)//if not set key ,then return the smallest ones

7) collaborative Filtering

   1)split the input movie data to trainingRDD, validationRDD,testRDD = ratingsRDD.randomSplit([6,2,2], seed =0)

   2)from pyspark.mllib.recommendation import ALS

      1)choose the model with the least errors

seed =5, iterations =0.1,rank from 4 ,8,12,  computeError is the RMSE =sqrt(summarize((h(xi)-y(i )**2))/n)  

for rank in ranks:
    model = ALS.train(trainingRDD, rank, seed=seed, iterations=iterations,
                      lambda_=regularizationParameter)
    predictedRatingsRDD = model.predictAll(validationForPredictRDD)
    error = computeError(predictedRatingsRDD, validationRDD)
    errors[err] = error
    err += 1
    print 'For rank %s the RMSE is %s' % (rank, error)
    if error < minError:
        minError = error
        bestRank = rank
   2)get that with the best result , then use testRDD, to get the predictResults


     

Andrew Ng在Coursera上的《Machine Learning》课程是全球范围内广受欢迎的机器学习入门课程之一,其内容涵盖了机器学习的基础理论与实际应用,包括监督学习、非监督学习、神经网络、支持向量机、聚类、降维等多个主题。课程笔记作为学习的重要辅助材料,通常由学习者根据课程内容自行整理,或参考社区提供的总结。 以下是课程的部分核心内容概述: ### 课程核心内容概述 #### 1. 机器学习定义 课程开篇介绍了机器学习的基本定义,引用了两位学者的观点: - Arthur Samuel (1959) 将机器学习定义为“一个赋予计算机在没有明确编程的情况下学习能力的领域”。 - Tom Mitchell (1998) 给出了一个更形式化的定义:一个程序被认为能从经验 $E$ 中学习,解决任务 $T$,达到性能度量值 $P$,当且仅当有了经验 $E$ 后,经过 $P$ 评判,程序在处理 $T$ 时的性能有所提升 [^3]。 #### 2. 线性回归(Linear Regression) 在 Week 2 的内容中,课程讲解了单变量和多变量线性回归模型。线性回归是一种用于预测连续输出值的监督学习方法。模型通过最小化代价函数(如均方误差)来拟合训练数据。梯度下降算法被广泛用于优化参数 [^4]。 #### 3. 逻辑回归与正则化(Logistic Regression & Regularization) Week 3 涉及逻辑回归,用于解决分类问题。逻辑回归使用 Sigmoid 函数将线性输出映射到概率值,并通过最大似然估计进行参数优化。正则化技术(如 L1 和 L2 正则化)被引入以防止模型过拟合 [^4]。 #### 4. 神经网络(Neural Networks) 课程深入介绍了神经网络的基本结构与训练方法,包括前向传播(Forward Propagation)与反向传播(Backpropagation)。神经网络能够处理复杂的非线性问题,是深度学习的核心基础 [^2]。 #### 5. 非监督学习与聚类(Unsupervised Learning & Clustering) 非监督学习部分介绍了如 K-Means 算法等聚类方法,以及主成分分析(PCA)等降维技术。此外,还提到了“鸡尾酒会问题”(Cocktail Party Problem),即在混杂声音中分离出个体声音或音乐信号的算法 [^5]。 #### 6. 支持向量机(Support Vector Machines, SVM) SVM 是一种强大的分类器,通过寻找最优超平面来最大化类别之间的边界。课程讲解了 SVM 的数学基础及其在不同核函数下的应用。 #### 7. 学习理论与诊断(Learning Theory & Diagnostics) 课程还涵盖了如何评估模型性能、诊断过拟合或欠拟合问题、交叉验证、学习曲线等实用技巧,帮助学习者更好地优化模型 [^1]。 --- ### 示例代码:线性回归模型(Python) 以下是一个使用 Scikit-Learn 实现简单线性回归的示例代码: ```python import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # 生成模拟数据 X = np.random.rand(100, 1) * 10 y = 2.5 * X + 1.2 + np.random.randn(100, 1) * 2 # 分割训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建线性回归模型 model = LinearRegression() model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 可视化 plt.scatter(X_test, y_test, color='blue', label='True values') plt.plot(X_test, y_pred, color='red', label='Predicted values') plt.xlabel('X') plt.ylabel('y') plt.legend() plt.show() ``` ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值