GBDT和deep learning之间对比和争论

最新推荐文章于 2021-12-28 12:00:00 发布

转载最新推荐文章于 2021-12-28 12:00:00 发布 · 2.9k 阅读

deep learning 专栏收录该内容

5 篇文章

订阅专栏

本文探讨了提升决策树在特定硬化的MNIST数据集上超越深度信念网络的表现，并讨论了其作为深度学习算法的有效性。文章还涉及了不同学习算法的深度定义问题。

Boosted Decision Trees for Deep Learning

Tags: Deep, Machine Learning, Supervised — jl@ 11:18 am

About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.

This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.

Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.

Comments (10)

10 Comments to “Boosted Decision Trees for Deep Learning”

Laurens van der Maaten says:

8/23/2010 at 3:05 pm

After seeing Ping’s talk about this work in February, I spent some time reproducing his results. Getting the same results as Ping turned out to be fairly easy. It certainly required a lot less tweaking than reproducing the results of DBNs and the like on the same data sets.

Having said that, I am not convinced that deep learning does a good job on all hardened MNIST data sets. For instance, on the data set that has natural images as background for the digits, today’s deep learners almost have to do a bad job: the unsupervised pre-training algorithms are wasting a lot of modeling power on trying to model the background images, the structure of which is completely irrelevant for the classification task. The subsequent discriminative fine-tuning then has to fix this, but can only do so much. So on some of the data sets considered in the paper, I don’t think it is too surprising that powerful discriminative learners such as boosted regression trees can do a better job than deep learning.

Reply
Itman says:

8/23/2010 at 3:56 pm

Quite interesting. I know that the Russian search engine Yandex uses boosted decision trees (similar to TreeRank) to implement a learning to rank approach.

Reply
Yoshua Bengio says:

8/24/2010 at 7:15 pm

Decision trees actually are shallow learners (2-levels, e.g. like SVMs and ordinary MLPs), as discussed in a paper soon to appear (but already available online there:http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/436). Basically a decision tree can be written as a sum-of-products where each product (over the arcs from the root to a leaf) selects a leaf. The paper discusses how they can only generalize locally (the training examples within each region defined by a leaf only provide examples about test examples in the same region). On the other hand, boosted decision trees add one level of depth to the weak learners, which makes them reasonably deep (considering that many deep learning results are often with just 3 or 4 levels).

Reply
hal says:

9/3/2010 at 8:55 pm

@Yoshua: I’m afraid I don’t quite follow something. You’re saying that trees are sums of products (eq 1 in the paper), which I totally buy. So those are two layers, and when you add boosting, you get three. This is stated in the paper as:

“What is the architectural depth of decision trees and decision forests? It depends on what elementary units of
computation are allowed on each level. By analogy with the disjunctive normal form (which is usually assigned an
architectural depth of two) one would assign an architectural depth of two to a decision tree, and of three to decision
forests or boosted trees.” (page 10)

I guess I don’t completely understand the reasoning behind saying that DNF has depth 2. Is there a citation I could look at that argues this point?

To me, it seems like you could make this claim about almost any learning algorithm. The argument for trees reads to me like “trees are just sums of [things]“, where [things] = a bunch of products over features. But couldn’t I say this about other algorithms? For instance, take a depth 1000 neural network. Why couldn’t I say that this is just two layers because the final layer is a sum of [things], where in this case [things] is another non-linear function over the features? Why do products not get to count as layers, but thresholded sums do get to count as layers?

(I’m not trying to be cantankerous — I’m really trying to understand!)

Reply
- Yoshua Bengio says:
  
  9/11/2010 at 9:10 am
  
  Hal,
  
  Yann LeCun and I wrote a paper on depth (Scaling learning algorithms toward AI: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/4) in which we discuss and define depth and give many examples. To define the depth of an architecture requires to also define what are the elementary computational units of the graph (whose depth is the longest path from input to output node). If we change the definition of what these units can do, we get a different value for depth, but such redefinitions typically only change depth by a multiplicative constant (e.g. we can replace a formal neuron by the composition of three kinds of things, sums, products and thresholds, i.e., multiply depth of a neural net by 3 by changing what we consider to be the elementary units of computation). In computational complexity papers, the typical elementary units that have been considered are summing units, product units, linear threshold units (formal neuron), and logic gates. Hence in the above paper, a linear classifier is typically considered to have depth 1, whereas a finite polynomial (sum of products) is considered to have depth 2, a traditional MLP (with one hidden layer) is considered to have depth 2, and an SVM or RBF network or kernel-based predictor is considered to have depth 2 (either the feature function phi(x) or the kernel function K(u,v)=phi(u).phi(v) can be considered in the set of computational elements). Nicely, we find that with all these kinds of computational element sets, depth 1 is insufficient to represent most functions, whereas depth 2 provides a universal approximator. Note how that works also for decision trees (which can basically be decomposed as a sum of products), that according this cultural convention also have depth 2 and are universal approximators. Boosting on top of decision trees adds a level of depth (it is a weighted sum of the hard decisions produced by the decision trees). That is why I wrote that that boosted decision trees have depth 3.
  
  In the paper on Scaling learning algorithms toward AI, we also introduce the notion that for learned architectures, we typically would like each level in the architecture to be adapted. Hence a depth 1000 neural net is a weighted sum over the outputs of the 999-th layer, but all 1000 layers are adaptive, and composed of the same kind of computational units, so it makes more sense to talk about a depth 1000 architecture in that case. Note how when we learn a polynomial we typically can adapt the coefficients (2nd layer parameters) and choose which products enter the polynomial (1st layer parameters), although of course one can consider all the possible products and then the 1st layer is not adaptive anymore, only the second. But practically, we generally can’t afford the coefficients of all possible products to be non-zero, so we need to do some kind of ‘feature selection’, which amounts to learning the first layer.
  
  Let me know if my explanations answer your questions and satisfy you. In conclusion, there is a formal definition of depth which can give different answers in practice because it depends on the choice of computational units, and there is a softer more fuzzy definition of depth that represents a kind of cultural convention (for computational elements) that typically makes depth-2 architectures to be universal approximators (with enough units in the first layer). It is under this convention that decision trees would reasonably be seen as depth-2 and boosted decision trees as depth-3 architectures.
  
  Reply