Regression、Classification、Clustering

本文介绍了常见的机器学习算法,包括回归、分类和聚类任务。涵盖了线性回归、决策树、支持向量机等经典算法及其优缺点,并讨论了如K-Means等聚类方法的应用场景。

1. Regression

Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores.

Regression tasks are characterized by labeled datasets that have a numeric target variable. In other words, you have some "ground truth" value for each observation that you can use to supervise your algorithm.

Linear regression

Linear Regression

1.1. (Regularized) Linear Regression

Linear regression is one of the most common algorithms for the regression task. In its simplest form, it attempts to fit a straight hyperplane to your dataset (i.e. a straight line when you only have 2 variables). As you might guess, it works well when there are linear relationships between the variables in your dataset.

In practice, simple linear regression is often outclassed by its regularized counterparts (LASSO, Ridge, and Elastic-Net). Regularization is a technique for penalizing large coefficients in order to avoid overfitting, and the strength of the penalty should be tuned.

  • Strengths: Linear regression is straightforward to understand and explain, and can be regularized to avoid overfitting. In addition, linear models can be updated easily with new data using stochastic gradient descent.
  • Weaknesses: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming.
  • Implementations: Python / R
1.2. Regression Tree (Ensembles)

Regression trees (a.k.a. decision trees) learn in a hierarchical fashion by repeatedly splitting your dataset into separate branches that maximize the information gain of each split. This branching structure allows regression trees to naturally learn non-linear relationships.

Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees. We won't go into their underlying mechanics here, but in practice, RF's often perform very well out-of-the-box while GBM's are harder to tune but tend to have higher performance ceilings.

  • Strengths: Decision trees can learn non-linear relationships, and are fairly robust to outliers. Ensembles perform very well in practice, winning many classical (i.e. non-deep-learning) machine learning competitions.
  • Weaknesses: Unconstrained, individual trees are prone to overfitting because they can keep branching until they memorize th
在机器学习中,**regression(回归)**、**classification(分类)** 和 **clustering(聚类)** 是三类核心任务类型。它们的区别如下: --- ## ✅ Clustering(聚类)是什么? > **Clustering(聚类)是一种无监督学习(Unsupervised Learning)方法**,它的目标是将数据集中的样本分成若干组(称为“簇”或 cluster),使得同一组内的样本相似度较高,不同组之间的相似度较低。 与分类和回归不同的是: - 分类是有标签的(监督学习) - 回归也是有标签的(预测连续值) - 聚类是没有标签的(无监督学习) --- ## 📌 常见的 Clustering 算法和模型 下面是一些最常用的聚类算法及其简要说明: ### 1. **K-Means(K均值聚类)** - 最经典的聚类算法。 - 需要预先指定聚类数 `k`。 - 通过迭代最小化样本到其所属聚类中心的距离来优化。 ```python from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3) kmeans.fit(X) labels = kmeans.predict(X) ``` --- ### 2. **Hierarchical Clustering(层次聚类)** - 构建一棵树状结构表示数据之间的嵌套聚类关系。 - 可分为凝聚型(agglomerative)和分裂型(divisive)。 ```python from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=3) labels = cluster.fit_predict(X) ``` --- ### 3. **DBSCAN(基于密度的空间聚类)** - 不需要预先指定簇的数量。 - 能发现任意形状的簇,并能识别噪声点。 ```python from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) labels = dbscan.fit_predict(X) ``` --- ### 4. **Mean Shift(均值漂移)** - 基于密度估计的非参数聚类方法。 - 自动确定簇的数量。 ```python from sklearn.cluster import MeanShift ms = MeanShift() labels = ms.fit_predict(X) ``` --- ### 5. **Gaussian Mixture Models(高斯混合模型,GMM)** - 使用概率模型对数据进行建模。 - 每个簇假设为一个高斯分布。 ```python from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components=3) gmm.fit(X) labels = gmm.predict(X) ``` --- ### 6. **Spectral Clustering(谱聚类)** - 利用图论思想进行聚类,适用于复杂结构的数据。 ```python from sklearn.cluster import SpectralClustering sc = SpectralClustering(n_clusters=3) labels = sc.fit_predict(X) ``` --- ## 🧠 总结对比表: | 算法名称 | 是否需要指定簇数 | 是否可处理噪声 | 是否支持任意形状簇 | 典型用途 | |----------|------------------|----------------|----------------------|-----------| | K-Means | ✅ 是 | ❌ 否 | ❌ 否 | 快速聚类、图像压缩 | | 层次聚类 | ✅ 是 | ❌ 否 | ❌ 否 | 树形结构分析 | | DBSCAN | ❌ 否 | ✅ 是 | ✅ 是 | 异常检测、空间聚类 | | Mean Shift | ❌ 否 | ✅ 是 | ✅ 是 | 图像分割 | | GMM | ✅ 是 | ❌ 否 | ❌ 否 | 概率建模 | | 谱聚类 | ✅ 是 | ❌ 否 | ✅ 是 | 社交网络分析 | ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值