GBDT和deep learning之间对比和争论

本文探讨了提升决策树在特定硬化的MNIST数据集上超越深度信念网络的表现,并讨论了其作为深度学习算法的有效性。文章还涉及了不同学习算法的深度定义问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Boosted Decision Trees for Deep Learning

Tags:  DeepMachine LearningSupervised —  jl@ 11:18 am

About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.

This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.

Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.

10 Comments to “Boosted Decision Trees for Deep Learning”
  1. After seeing Ping’s talk about this work in February, I spent some time reproducing his results. Getting the same results as Ping turned out to be fairly easy. It certainly required a lot less tweaking than reproducing the results of DBNs and the like on the same data sets.

    Having said that, I am not convinced that deep learning does a good job on all hardened MNIST data sets. For instance, on the data set that has natural images as background for the digits, today’s deep learners almost have to do a bad job: the unsupervised pre-training algorithms are wasting a lot of modeling power on trying to model the background images, the structure of which is completely irrelevant for the classification task. The subsequent discriminative fine-tuning then has to fix this, but can only do so much. So on some of the data sets considered in the paper, I don’t think it is too surprising that powerful discriminative learners such as boosted regression trees can do a better job than deep learning.

  2. Itman  says:

    Quite interesting. I know that the Russian search engine Yandex uses boosted decision trees (similar to TreeRank) to implement a learning to rank approach.

  3. Yoshua Bengio  says:

    Decision trees actually are shallow learners (2-levels, e.g. like SVMs and ordinary MLPs), as discussed in a paper soon to appear (but already available online there:http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/436). Basically a decision tree can be written as a sum-of-products where each product (over the arcs from the root to a leaf) selects a leaf. The paper discusses how they can only generalize locally (the training examples within each region defined by a leaf only provide examples about test examples in the same region). On the other hand, boosted decision trees add one level of depth to the weak learners, which makes them reasonably deep (considering that many deep learning results are often with just 3 or 4 levels).

  4. hal  says:

    @Yoshua: I’m afraid I don’t quite follow something. You’re saying that trees are sums of products (eq 1 in the paper), which I totally buy. So those are two layers, and when you add boosting, you get three. This is stated in the paper as:

    “What is the architectural depth of decision trees and decision forests? It depends on what elementary units of
    computation are allowed on each level. By analogy with the disjunctive normal form (which is usually assigned an
    architectural depth of two) one would assign an architectural depth of two to a decision tree, and of three to decision
    forests or boosted trees.” (page 10)

    I guess I don’t completely understand the reasoning behind saying that DNF has depth 2. Is there a citation I could look at that argues this point?

    To me, it seems like you could make this claim about almost any learning algorithm. The argument for trees reads to me like “trees are just sums of [things]“, where [things] = a bunch of products over features. But couldn’t I say this about other algorithms? For instance, take a depth 1000 neural network. Why couldn’t I say that this is just two layers because the final layer is a sum of [things], where in this case [things] is another non-linear function over the features? Why do products not get to count as layers, but thresholded sums do get to count as layers?

    (I’m not trying to be cantankerous — I’m really trying to understand!)

    • Yoshua Bengio  says:

      Hal,

      Yann LeCun and I wrote a paper on depth (Scaling learning algorithms toward AI: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/4) in which we discuss and define depth and give many examples. To define the depth of an architecture requires to also define what are the elementary computational units of the graph (whose depth is the longest path from input to output node). If we change the definition of what these units can do, we get a different value for depth, but such redefinitions typically only change depth by a multiplicative constant (e.g. we can replace a formal neuron by the composition of three kinds of things, sums, products and thresholds, i.e., multiply depth of a neural net by 3 by changing what we consider to be the elementary units of computation). In computational complexity papers, the typical elementary units that have been considered are summing units, product units, linear threshold units (formal neuron), and logic gates. Hence in the above paper, a linear classifier is typically considered to have depth 1, whereas a finite polynomial (sum of products) is considered to have depth 2, a traditional MLP (with one hidden layer) is considered to have depth 2, and an SVM or RBF network or kernel-based predictor is considered to have depth 2 (either the feature function phi(x) or the kernel function K(u,v)=phi(u).phi(v) can be considered in the set of computational elements). Nicely, we find that with all these kinds of computational element sets, depth 1 is insufficient to represent most functions, whereas depth 2 provides a universal approximator. Note how that works also for decision trees (which can basically be decomposed as a sum of products), that according this cultural convention also have depth 2 and are universal approximators. Boosting on top of decision trees adds a level of depth (it is a weighted sum of the hard decisions produced by the decision trees). That is why I wrote that that boosted decision trees have depth 3.

      In the paper on Scaling learning algorithms toward AI, we also introduce the notion that for learned architectures, we typically would like each level in the architecture to be adapted. Hence a depth 1000 neural net is a weighted sum over the outputs of the 999-th layer, but all 1000 layers are adaptive, and composed of the same kind of computational units, so it makes more sense to talk about a depth 1000 architecture in that case. Note how when we learn a polynomial we typically can adapt the coefficients (2nd layer parameters) and choose which products enter the polynomial (1st layer parameters), although of course one can consider all the possible products and then the 1st layer is not adaptive anymore, only the second. But practically, we generally can’t afford the coefficients of all possible products to be non-zero, so we need to do some kind of ‘feature selection’, which amounts to learning the first layer.

      Let me know if my explanations answer your questions and satisfy you. In conclusion, there is a formal definition of depth which can give different answers in practice because it depends on the choice of computational units, and there is a softer more fuzzy definition of depth that represents a kind of cultural convention (for computational elements) that typically makes depth-2 architectures to be universal approximators (with enough units in the first layer). It is under this convention that decision trees would reasonably be seen as depth-2 and boosted decision trees as depth-3 architectures.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值