Machine Learning and AI has continued to be a growing and expanding field of interest and work for several years. It has continued to gain a lot of attention and has seen an influx of many undergraduates / working professionals to join and explore. So, if you are a beginner and need help with Decision Trees and their family of Ensemble methods, this story is for you.
多年来,机器学习和AI一直是一个不断增长和扩展的兴趣和工作领域。 它继续引起了很多关注,并且吸引了许多大学生/在职专业人员的加入和探索。 因此,如果您是初学者,并且需要决策树及其Ensemble方法系列的帮助,那么本故事适合您。
介绍 : (Introduction :)
The aim is to correctly classify the types of Glass, based on the number of elements (like Ca, Mg, etc.) they contain and their Refractive Index.
目的是根据玻璃中所含元素的数量(例如Ca,Mg等)及其折射率正确分类玻璃。

As you can see above, we have 214 rows and 10 columns. The first nine columns are the Features / Independent Variable. And the last column ‘Type’ is our target variable and describe what is the kind (class) of the glass
如您在上面看到的,我们有214行和10列。 前九列是功能/自变量。 最后一列“类型”是我们的目标变量 ,它描述了什么是玻璃的种类( 类 )

We have six classes in our Data-set. Also, it can be seen that there is a high class-imbalance, i.e. the number of examples we have isn’t the same. This could lead to some loss in accuracy for our model, as our model might be biased towards one class.
我们的数据集中有六个类。 此外,可以看出存在很高的类不平衡 ,即我们拥有的示例数不相同。 这可能会导致我们模型的准确性有所下降,因为我们的模型可能偏向一个类别。
In cases where we have more than two classes, it is better to use a Classifier that has a non-linear decision boundary so that we can have more accurate predictions. Some examples of Non-Linear Classification Algorithms are Kernelised — Support Vector Machines, Decision Trees, and even a Logistic Regression model with Polynomial Features.
如果我们有两个以上的类,则最好使用具有非线性决策边界的分类器,以便我们可以进行更准确的预测。 非线性分类算法的一些示例已内核化-支持向量机,决策树,甚至是具有多项式特征的Logistic回归模型。
After some data-preprocessing and train-test split, we will be creating our Models. We will be using the same training and test set for each of the models.
经过一些数据预处理和火车测试拆分后,我们将创建模型。 我们将对每个模型使用相同的训练和测试集。
第一部分:决策树 (Part I: Decision Trees)
Decision Trees, as mentioned in the CS-229 class is a greedy, top-down, and recursive partitioning algorithm. Its advantages are — Non-linearity, support fo