机器学习是一门比较综合的课程,本文简单介绍机器学习的概念以及它的分类。机器学习跟数据挖掘在很多地方是重合的,比如他们都用相同的算法,但是两者的区别在于机器学习侧重于预测从训练数据得到的属性,而数据挖掘则侧重于挖掘数据中未知的属性。
1. Supervised Learning
SL is always used to classify some labeled data. Imagine we have a bunch of labeled data sets T = {(l,c) belongs to I x C} and their predefined classes: C ={c1, …, ck}. The task here is to find a mapping of labeled data and classes, so that any labeled data l from I, m(l) = c. After the training of our model, we can present new instances to m and compute the classes.
For instance, you are a cancer doctor, each day you see many patients.We want to use a mode to predict whether the cancer is benign or malignant. We can at first use the cancer records in the past to build a model according to the characteristics, which will be used to pre-check all patients at first with his ill descriptions. E.g. according to the age of patient and the size of tumor to classify the patient. If the patient is older than 55 years old and the tumor is bigger than 10mm, this tumor will be classified malignant.
2. Unsupervised Learning
USL is used to do clustering, for instance we can build the different clustering with different topics. In contrast to SL, there is not predefined group longer, what’s given is just only the describing instances. The problem we are facing is how to build the clustering, in other words, what’s thecriteria for clustering, by similarity or distance between instances? Formally,let dataset I be given, our task is to group this dataset I, g(l) = C and find a mapping m between I and C, so that so that any labeled data l from I, m(l) = c .
3. Semi-supervised Learning
It locates between SL and USL, namely the training data consists of some labeled data and some unlabeled data, our task here is either to classifying or toclustering. This learning method comes because the unlabeled data is cheaper and labeled data is hard to get. E.g. sometimes the human expert gives thewrong label to instances or some special methods or devices will be needed to label a data. In this algorithm the number of unlabeled data is quite bigger than the labeled data.
All of above is the basic categories of machine learning.