最近重新旁听一门数据挖掘课程,分类算法首先讲到的是决策树算法。
简而言之,决策树即为if-then结构的树。
输入:训练集{(x1,y1),(x2,y2),…,(xn,yn)};
属性集{a1,a2,…,ad}
过程:函数DecesionTree(D,A)
if 数据D均属于同一属性,就将结点标记为对应类叶节点,return;
if A=空集,return
经计算(这里选择ID3算法计算信息增益)选择最优划分属性a*,生成结点;
for a*的每个值a*v,do
生成一个分支,令Dv表示D在a*上取值为a*v的样本子集;
if Dv为空 then
将分支标记为叶节点,return
else
以DecesionTree{Dv,A{a*}}为分支节点(在每次生成分支时,将已使用的属性去除,递归生成子树,直到属性集划分完毕)
end
输出:一棵决策树。
运行环境:Mac OS10.12.3,PyCharm Community Edition
代码内容:
main.py
import DecisionTree
def main():
# Insert input file
"""
IMPORTANT: Change this file path to change training data
"""
file = open('WeatherTraining.csv')
"""
IMPORTANT: Change this variable too change target attribute
"""
target = "play"
data = [[]]
for line in file:
line = line.strip("\r\n")
data.append(line.split(','))
data.remove([])
attributes = data[0]
data.remove(attributes)
# Run ID3
tree = DecisionTree.makeTree(data, attributes, target, 0)
print "generated decision tree"
# Generate program
file = open('program.py', 'w')
file.write("import Node\n\n")
# open input file
file.write("data = [[]]\n")
"""
IMPORTANT: Change this file path to change testing data
"""
file.write("f = open('Soybean.csv')\n")
# gather data
file.write("for line in f:\n\tline = line.strip(\"\\r\\n\")\n\tdata.append(line.split(','))\n")
file.write("data.remove([])\n")
# input dictionary tree
file.write("tree = %s\n" % str(tree))
file.write("attributes = %s\n" % str(attributes))
file.write("count = 0\n")
file.write("for entry in data:\n")
file.write("\tcount += 1\n")
# copy dictionary
file.write("\ttempDict = tree.copy()\n")
file.write("\tresult = \"\"\n")
# generate actual tree
file.write("\twhile(isinstance(tempDict, dict)):\n")
file.write("\t\troot = Node.Node(tempDict.keys()[0], tempDict[tempDict.keys()[0]])\n")
file.write("\t\ttempDict = tempDict[tempDict.keys()[0]]\n")
# this must be attribute