<机器学习笔记-01><scikit-learn 01>机器学习基础

  1. 目的
    笔者计划对“机器学习machine learning”进行学习,为了对知识和技能进行及时地梳理和总结,同时也为了能够与同道者交流学习,决心以笔记分享的方式来记录这一过程,也算是对自己的一种鞭策吧。

  2. 架构&思路 & 参考资料:
    a. 从工程师立场来看,问题的解决往往可以分为两个过程:造积木和搭积木;具体到机器学习的领域,创造新的算法,并对其进行封装,这是造积木的部分,而将不同的算法应用到各种工程实际中,与具体的项目结合起来,就是搭积木的过程。工程师呢,需要知道怎么造积木的原理和方法,但是更重要的是知道,怎么去用这些积木;所以笔记中,会力求掌握算法的原理,并结合实际项目应用之。
    b. 非常陡峭的学习曲线往往会让人非常沮丧,而让学习曲线变得平缓起来的方法,就是螺旋上升。一开始呢,学用结合,在实际项目中锻炼成长,边做边学,增强对于复杂理论和概念的理解,培养实际应用机器学习解决实际问题的能力;当项目完成后,再回头看那些复杂的公式,或许会有一种云淡风轻、似曾相识的感觉吧。
    c. scikit-learn是Python的机器学习库,支持分类、回归、降维、聚类等算法,包含特征提取、数据处理、模型评估等模块。文档多、上手快,作为上手学习机器学习的工具还是非常不错的。
    d. 参考书: 推荐《Mastering Machine Learning With scikit-learn》,优快云的下载里有中英文(非影印)版,以及源代码和数据

  3. Python和scikit-learn开发环境搭建
    强烈推荐Anaconda python集成开发环境,支持windows, Mac OS, Linux等平台;安装后,平台已集成scikit-learn、Numpy、Pandas等库,可以省去配置开发环境的麻烦;

  4. 机器学习基础:
    a. 机器学习基础是归纳generalize,从已知数据中找未知规律;
    b. Tom Mitchell定义“一个程序在完成任务T(task)后获得经验E(experience),其表现为效果P(performance);如果它完成任务T(task)的效果为P(performance),那么就获得经验E(experience)”;

  5. 监督学习、无监督学习
    a. 监督学习:输入和输出的经验数据已经成对的标记好; 从有正确答案的例子中学习,需要对类似的问题作出正确的反馈;输入变量称为特征,测量的现象作为解释变量;输出结果称为响应值;
    b. 无监督学习:输入和输出的经验数据没有标记;需要在这些未标记的数据中发现规律;
    c. 增强学习(Reinforcement learning):半监督学习的一个案例;问题可以通过决策获得反馈,但是反馈与决策中没有直接关系;

  6. 机器学习任务:
    a. 监督学习:分类(classification)和回归(regression);分类需要从若干变量约束条件中预测出新观测值的类型、种类、标签;回归则是预测连续变量的数值;
    b. 无监督学习:聚类降维;聚类就是通过一些相似性度量方法把一些观测值分成同一类;降维是发现对响应值影响最大的解释变量的过程;

  7. 训练数据和测试数据
    a. 训练集(training set):构成监督学习经验的案例集合;
    b. 测试集(test set):评估程序效果的案例集合;
    c. 验证集(validation set):调整超参数变量的案例集合;超参数变量控制模型如何学习;
    d. 监督学习的观测值分成三部分:训练集(50%),测试集(50%),验证集(25%);
    e. 过度拟合(over-fitting)vs欠拟合:过度拟合是指能够在训练集上获得较好地拟合的假设,在训练集之外的案例集合却不能很好的拟合数据;原因是存在噪音或者训练数据太少;正则化(regularization)可以减轻过度拟合程度;
    f. “放入的是垃圾,出来的也是垃圾”:监督学习需要用有代表性、标签正确的数据集进行训练;多而不好的数据,训练效果不一定比少而好的数据好。
    g. 交叉验证:用相同的数据对算法进行多次训练和检验;适用于训练集不够的时候;数据训练集分成N块,算法用N-1块进行训练,再用最后一块进行测试;

  8. 效果评估-偏差和方差
    a. 监督学习中,两个基本指标评估预测误差偏差(bias)和方差(variance);高方差是过度拟合了训练集数据,高偏差则是拟合不够的表现;
    b. 偏差-方差均衡:现在中二者具有背反特征,降低一个指标,另一个指标会增加;
    c. 无监督学习:没有预测误差,评估数据结构的一些属性;评估方法针对于具体的任务;
    d. 无监督评估举例-恶性肿瘤预测(真阳性TP(true positive)+真阴性TN(true negative)+假阳性FP(false positive )+假阴性FN(false negative)):
    准确度评估accuracy=(TP+TN)/(TP+TN+FP+FN);
    恶性肿瘤精确度precision=TP/(TP+FP)
    召回率recall=TP/(TP+FN)
    召回率比其他指标更满足实际要求;

Mastering Machine Learning with scikit-learn - Second Edition by Gavin Hackeling English | 24 July 2017 | ASIN: B06ZYRPFMZ | ISBN: 1783988363 | 254 Pages | AZW3 | 5.17 MB Key Features Master popular machine learning models including k-nearest neighbors, random forests, logistic regression, k-means, naive Bayes, and artificial neural networks Learn how to build and evaluate performance of efficient models using scikit-learn Practical guide to master your basics and learn from real life applications of machine learning Book Description Machine learning is the buzzword bringing computer science and statistics together to build smart and efficient models. Using powerful algorithms and techniques offered by machine learning you can automate any analytical model. This book examines a variety of machine learning models including popular machine learning algorithms such as k-nearest neighbors, logistic regression, naive Bayes, k-means, decision trees, and artificial neural networks. It discusses data preprocessing, hyperparameter optimization, and ensemble methods. You will build systems that classify documents, recognize images, detect ads, and more. You will learn to use scikit-learn's API to extract features from categorical variables, text and images; evaluate model performance, and develop an intuition for how to improve your model's performance. By the end of this book, you will master all required concepts of scikit-learn to build efficient models at work to carry out advanced tasks with the practical approach. What you will learn Review fundamental concepts such as bias and variance Extract features from categorical variables, text, and images Predict the values of continuous variables using linear regression and K Nearest Neighbors Classify documents and images using logistic regression and support vector machines Create ensembles of estimators using bagging and boosting techniques Discover hidden structures in data using K-Means clustering Evaluate the performance of machine learning systems in common tasks About the Author Gavin Hackeling is a data scientist and author. He was worked on a variety of machine learning problems, including automatic speech recognition, document classification, object recognition, and semantic segmentation. An alumnus of the University of North Carolina and New York University, he lives in Brooklyn with his wife and cat. Table of Contents The Fundamentals of Machine Learning Simple linear regression Classification and Regression with K Nearest Neighbors Feature Extraction and Preprocessing From Simple Regression to Multiple Regression From Linear Regression to Logistic Regression Naive Bayes Nonlinear Classification and Regression with Decision Trees From Decision Trees to Random Forests, and other Ensemble Methods The Perceptron From the Perceptron to Support Vector Machines From the Perceptron to Artificial Neural Networks Clustering with K-Means Dimensionality Reduction with Principal Component Analysis
Mastering Machine Learning with scikit-learn (2 ed) (True PDF + AWZ3 + codes) Table of Contents Preface 1 Chapter 1: The Fundamentals of Machine Learning 6 Defining machine learning 6 Learning from experience 8 Machine learning tasks 9 Training data, testing data, and validation data 10 Bias and variance 13 An introduction to scikit-learn 15 Installing scikit-learn 16 Installing using pip 17 Installing on Windows 17 Installing on Ubuntu 16.04 17 Installing on Mac OS 17 Installing Anaconda 18 Verifying the installation 18 Installing pandas, Pillow, NLTK, and matplotlib 18 Summary 19 Chapter 2: Simple Linear Regression 20 Simple linear regression 20 Evaluating the fitness of the model with a cost function 25 Solving OLS for simple linear regression 27 Evaluating the model 29 Summary 31 Chapter 3: Classification and Regression with k-Nearest Neighbors 32 K-Nearest Neighbors 32 Lazy learning and non-parametric models 33 Classification with KNN 34 Regression with KNN 42 Scaling features 44 Summary 47 Chapter 4: Feature Extraction 48 Extracting features from categorical variables 48 Standardizing features 49 [ ii ] Extracting features from text 50 The bag-of-words model 50 Stop word filtering 53 Stemming and lemmatization 54 Extending bag-of-words with tf-idf weights 57 Space-efficient feature vectorizing with the hashing trick 59 Word embeddings 61 Extracting features from images 64 Extracting features from pixel intensities 65 Using convolutional neural network activations as features 66 Summary 68 Chapter 5: From Simple Linear Regression to Multiple Linear Regression 70 Multiple linear regression 70 Polynomial regression 74 Regularization 79 Applying linear regression 80 Exploring the data 81 Fitting and evaluating the model 84 Gradient descent 86 Summary 90 Chapter 6: From Linear Regression to Logistic Regression 91 Binary classification with logistic regression 92 Spam filtering 94 Binary classification performance metrics 95 Accuracy 97 Precision and recall 98 Calculating the F1 measure 99 ROC AUC 100 Tuning models with grid search 102 Multi-class classification 104 Multi-class classification performance metrics 107 Multi-label classification and problem transformation 108 Multi-label classification performance metrics 113 Summary 114 Chapter 7: Naive Bayes 115 Bayes' theorem 115 Generative and discriminative models 117 [ iii ] Naive Bayes 118 Assumptions of Naive Bayes 119 Naive Bayes with scikit-learn 120 Summary 124 Chapter 8: Nonlinear Classification and Regression with Decision Trees 125 Decision trees 125 Training decision trees 127 Selecting the questions 128 Information gain 131 Gini impurity 136 Decision trees with scikit-learn 137 Advantages and disadvantages of decision trees 139 Summary 140 Chapter 9: From Decision Trees to Random Forests and Other Ensemble Methods 141 Bagging 141 Boosting 144 Stacking 146 Summary 148 Chapter 10: The Perceptron 149 The perceptron 149 Activation functions 150 The perceptron learning algorithm 152 Binary classification with the perceptron 153 Document classification with the perceptron 161 Limitations of the perceptron 162 Summary 163 Chapter 11: From the Perceptron to Support Vector Machines 164 Kernels and the kernel trick 165 Maximum margin classification and support vectors 169 Classifying characters in scikit-learn 172 Classifying handwritten digits 172 Classifying characters in natural images 175 Summary 177 Chapter 12: From the Perceptron to Artificial Neural Networks 178 Nonlinear decision boundaries 179 [ iv ] Feed-forward and feedback ANNs 180 Multi-layer perceptrons 181 Training multi-layer perceptrons 183 Backpropagation 184 Training a multi-layer perceptron to approximate XOR 189 Training a multi-layer perceptron to classify handwritten digits 192 Summary 193 Chapter 13: K-means 194 Clustering 194 K-means 197 Local optima 203 Selecting K with the elbow method 204 Evaluating clusters 207 Image quantization 209 Clustering to learn features 211 Summary 214 Chapter 14: Dimensionality Reduction with Principal Component Analysis 215 Principal component analysis 215 Variance, covariance, and covariance matrices 220 Eigenvectors and eigenvalues 222 Performing PCA 224 Visualizing high-dimensional data with PCA 227 Face recognition with PCA 228 Summary 231 Index 233
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值