Regression、Classification、Clustering

最新推荐文章于 2025-05-20 16:49:11 发布

原创

最新推荐文章于 2025-05-20 16:49:11 发布 · 置顶 · 4.3k 阅读

5 ·

CC 4.0 BY-SA版权

本文介绍了常见的机器学习算法，包括回归、分类和聚类任务。涵盖了线性回归、决策树、支持向量机等经典算法及其优缺点，并讨论了如K-Means等聚类方法的应用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. Regression

Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores.

Regression tasks are characterized by labeled datasets that have a numeric target variable. In other words, you have some "ground truth" value for each observation that you can use to supervise your algorithm.

Linear Regression

1.1. (Regularized) Linear Regression

Linear regression is one of the most common algorithms for the regression task. In its simplest form, it attempts to fit a straight hyperplane to your dataset (i.e. a straight line when you only have 2 variables). As you might guess, it works well when there are linear relationships between the variables in your dataset.

In practice, simple linear regression is often outclassed by its regularized counterparts (LASSO, Ridge, and Elastic-Net). Regularization is a technique for penalizing large coefficients in order to avoid overfitting, and the strength of the penalty should be tuned.

Strengths: Linear regression is straightforward to understand and explain, and can be regularized to avoid overfitting. In addition, linear models can be updated easily with new data using stochastic gradient descent.
Weaknesses: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming.
Implementations: Python / R

1.2. Regression Tree (Ensembles)

Regression trees (a.k.a. decision trees) learn in a hierarchical fashion by repeatedly splitting your dataset into separate branches that maximize the information gain of each split. This branching structure allows regression trees to naturally learn non-linear relationships.

Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees. We won't go into their underlying mechanics here, but in practice, RF's often perform very well out-of-the-box while GBM's are harder to tune but tend to have higher performance ceilings.

Strengths: Decision trees can learn non-linear relationships, and are fairly robust to outliers. Ensembles perform very well in practice, winning many classical (i.e. non-deep-learning) machine learning competitions.
Weaknesses: Unconstrained, individual trees are prone to overfitting because they can keep branching until they memorize th