Kaggle学习 Learn Machine Learning 1.How Models Work 模型是怎么工作的

本文介绍了机器学习中的决策树模型,包括其基本原理和工作流程。通过一个简单的例子展示了如何使用决策树进行房价预测,以及如何改进模型以提高预测准确性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.How ModelsWork 模型是怎么工作的

本文是Kaggle自助学习下的文章,转回到目录点击这里



Notebook 笔记

         This is the firststep in Kaggle's Learn Machine Learning series. 这是Kaggle的LearnMachine Learning系列的第一章。

Introduction 介绍

We'llstart with an overview of how machine learning models work and how they areused. This may feel basic if you've done statistical modeling or machinelearning before. Don't worry, we will progress to building powerful modelssoon.我们将首先概述机器学习模型如何工作以及如何使用它们。如果你之前已经完成了统计建模或机器学习,这可能会感觉基础。别担心,我们会尽快建立强大的模型。

 

Thecourse will have you build models for the following scenario:本课程将为你构建以下场景的模型:

 

Yourcousin has made millions of dollars speculating on real estate. He's offered tobecome business partners with you because of your interest in data science.He'll supply the money, and you'll supply models that predict how much varioushouses are worth.你的表弟已经让数百万美元的房地产投机。由于你对数据科学感兴趣,他被邀请成为你的商业伙伴。他会提供这笔钱,你会提供预测各种房屋价值的模型。

 

You askyour cousin how he's predicted real estate values in the past. and he says itis just intuition. But more questioning reveals that he's identified pricepatterns from houses he has seen in the past, and he uses those patterns tomake predictions for new houses he is considering.你问你的表哥他过去是如何预测房地产价值的。他说just直觉。但更多的调查表明,他发现了他过去看过的房屋的价格模式,并且使用这些模式来预测他正在考虑的新房子。

 

Machinelearning works the same way. We'll start with a model called the Decision Tree.There are fancier models that give more accurate predictions. But decisiontrees are easy to understand, and they are the basic building block for some ofthe best models in data science.机器学习以同样的方式工作。我们将从一个名为决策树的模型开始。有更加精确的预测模型。但决策树很容易理解,它们是数据科学中一些最佳模型的基本构建块。

 

Forsimplicity, we'll start with the simplest possible decision tree.为了简单起见,我们将从最简单的可能决策树开始。

 


 


 

Itdivides houses into only two categories. You predict the price of a new houseby finding out which category it's in, and the prediction is the historicalaverage price from that category.它将房屋分为两类。你通过确定新房的价格来预测新房的价格,预测是该类别的历史平均价格。

 

Thiscaptures the relationship between house size and price. We use data to decidehow to break the houses into two groups, and then again to determine thepredicted price in each group. This step of capturing patterns from data iscalled fitting or training the model. The data used to fit the model is calledthe training data.这抓住了房屋大小和价格之间的关系。我们使用数据来决定如何将房屋分成两组,然后再确定每组的预测价格。从数据捕获如何分的模式的这一步称为拟合或训练模型。用于拟合模型的数据称为训练数据。

 

Thedetails of how the model is fit (e.g. how to split up the data) is complexenough that we will save it for later. After the model has been fit, you canapply it to new data to predict prices of additional homes.模型如何适合的详细信息(例如,如何分割数据)非常复杂,以至于我们稍后会保存它。在模型适用后,你可以将其应用于新数据以预测其他住宅的价格。

Example 例:


Assuming yourdecision tree works in a sensible way, which of the two trees shown here do youthink you might get from fitting this especially simple decision tree?假设你的决策树以合理的方式工作,那么你认为你可能从拟合这个特别简单的决策树中得到哪棵树?

 

Improving the Decision Tree 改进决策树

       Thedecision tree on the left (Decision Tree 1) probably makes more sense, becauseit captures the reality that houses with more bedrooms tend to sell at higherprices than houses with fewer bedrooms. The biggest shortcoming of this modelis that it doesn't capture most factors affecting home price, like number ofbathrooms, lot size, location, etc.左边的决策树(决策树1)可能更有意义,因为它捕捉到更多卧室的房屋倾销价格高于卧室更少的房子的现实。这种模式最大的缺点是它不能捕捉到影响房价的多数因素,如浴室数量,地段数量,地点等。

 

Youcan capture more factors using a tree that has more "splits." Theseare called "deeper" trees. A decision tree that also considers thetotal size of each house's lot might look like this:你可以使用具有更多“分支”的树捕捉到更多因素。这些被称为“更深”的树。决策树也会考虑每个房子的总面积大小可能如下所示:


You predict the price of any house by tracing through the decision tree, alwayspicking the path corresponding to that house's characteristics. The predictedprice for the house is at the bottom of the tree. The point at the bottom wherewe make a prediction is called a leaf.你通过在决策树中追踪来预测任何房屋的价格,并始终选择与房屋特征相对应的路径。房子的预测价格在树的底部。在我们做出预测的底部,这个点叫做叶结点

 

Thesplits and values at the leaves will be determined by the data, so it's timefor you to check out the data you will be working with.叶上的分割和值将由数据决定,因此你需要检查将要使用的数据。

 

Continue

      You will write code as part of an ongoing data scienceproject for the rest of the tutorials. Click here to get started.你将编写代码作为其余教程的持续数据科学项目的一部分。点击这里开始。

   

    PS:此仅仅翻译了Kaggle课程,有条件的同学请到Kaggle上看。(遇到难点,也会当做学习笔记留在下面)

本文是Kaggle自助学习下的文章,转回到目录点击这里


 

### Kaggle Machine Learning Datasets and Tutorials Kaggle is a platform that provides an extensive collection of datasets, kernels (notebooks), and competitions to help individuals learn about data science and machine learning[^1]. The following sections outline the resources available on Kaggle related to machine learning. #### Datasets Kaggle hosts numerous datasets covering various domains such as healthcare, finance, social media analysis, etc. These datasets are curated by both organizations and individual contributors. Users can download these datasets directly from the website or use APIs provided by Kaggle for programmatic access[^2]. For example, one popular dataset often used in beginner-level projects includes Titanic: Machine Learning from Disaster where participants predict survival outcomes based on passenger information like age, gender, class, fare paid among others. #### Tutorials & Kernels Tutorials come under two categories - guided courses offered through partnership with experts which require registration but offer certification upon completion; secondly there exist community-contributed notebooks known as 'kernels'. Guided Courses cover topics ranging from introductory Python programming all way up advanced neural networks while Community Notebooks provide practical examples demonstrating how specific algorithms work using real-world problems alongside code snippets written primarily either R or python language depending user preference: Here’s a simple illustration showing logistic regression implementation within Jupyter Notebook environment utilizing Scikit-Learn library over Iris flower classification problem: ```python from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import numpy as np # Load iris dataset data = load_iris() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target) clf = LogisticRegression(random_state=0).fit(X_train, y_train) print(f'Accuracy Score:{np.round(clf.score(X_test,y_test)*100)}%') ``` This script demonstrates loading IRIS sample set into memory then splitting it randomly between training/testing groups before applying standard binary classifier algorithm called Logit Regression finally printing out accuracy percentage achieved during evaluation phase against unseen test cases not part original teaching material given earlier stages process pipeline execution flow sequence order steps taken here shown above clearly explained manner easy understand follow along practice try yourself home computer system setup ready go start experimenting immediately once installed necessary software packages required run successfully without errors encountered runtime exceptions thrown unexpected situations arise need troubleshooting resolve quickly efficiently move forward continue learning journey path success achieve goals aspirations dreams become reality true!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值