Multi-task Learning

本文介绍了多任务学习的基本概念,探讨了其数学表述,并讨论了几种关键的正则化方法,如方差正则化和结构稀疏性。此外,文章还深入研究了多任务学习与核方法之间的联系。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

基于Supervised Learning Lecture 8

Multi-task learning

  • Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
  • The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
  • Typical scenario: many tasks many tasks but only few examples per task. If n<d we don’t have enough data to learn the tasks one by one. However, if the tasks are related and set S or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL).
  • When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
  • Applications: Learning a set of linear classi ers for related objects
    (cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech…
  • Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
  • The ideas can be extended to non-linear cases through RKHS.

Mathematical formulation

  • Fix probability measures μ1,,μT on Rd×R
    – T tasks
    – Each task is a probability measure, e.g. μt(x,y)=P(x)δ(w,xy) . δ is a deterministic function, interpreted as the conditional probability and wx is an underlying parameter
    Rd can also be a Hilbert space
  • Draw data: (xt1vector,yt1scalar),,(xtn,ytn)μt,t=1,,T (in practice n may vary with t)
  • Learning method:

    min(f1,,fT)F1Tt=1T1ni=1n(yti,ft(xti))


    where F is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components.
  • Goal is to minimise the multi-task error

    R(f1,,fT)=1Tt=1TE(x,y)μt(yti,ft(xti))

Linear MTL

  • “task” = “linear model”
    – Regression: yti=wt,xti+ϵti
    – Binary classification: yti=sign(wt,xti)ϵti
  • Learning method: min(w1,,wT)S1TTt=11nni=1(yti,wt,xti) . Here, S incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser.
  • The multitask error of W=[w1,,wT] is: R(W)=1TTt=1E(x,y)μt(yti,wt,x)
  • It is possible to give bounds on the uniform deviation
    supWS{R(W)1Tt=1T1ni=1n(yti,wt,xti)}

    and derive bounds for excess error
    R(W^)minWSR(W)

Regularisers for linear MTL

Often we drop the constraint (i.e. WS ) and consider the penalty methods

minw1,,wT1Tt=1T1ni=1n(yti,wt,xti)+λΩ(w1,,wT)

Different regularisers encourage different types of commonalities between the tasks:

  • variance (or other convex quadratic regularisers) encourage closeness to mean
    Ωvar=1Tt=1T||wt||2+1γγVar(w1,,wT)
  • Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
    ||W||2,1:=j=1dt=1Tw2tj
  • Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features
    ||w1,,wT||tr

    – extension of joint sparsity; rotate the initial data representation
    – The l1 norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace)
  • More sophisticated regularisers which combine the above, promote clustering of tasks, etc.

Quadratic regulariser

  • general quadratic regulariser
    Ωvar=s,t=1Tws,Estwt

    where the matrix E=(Est)Ts,t=1RdT×dT is positive definite.
    E
  • variance regulariser
    Let γ[0,1] and
    Ωvar=1Tt=1T||wt||2+1γγVar(w1,,wT)=1Tt=1T||wt||2+1γγt=1T||wtw¯||22

    γ=1 : independent tasks; γ=0 : identical tasks
    – regulariser favours weight vectors which are close to its mean.
    – If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other)
  • Link to the kernel methods (quadratic regulariser)
    The problem
    minw1,,wT1Tt=1T1ni=1n(yti,wt,xti)+λs,t=1Tws,Estwt

    is equivalent to

    minv1Tt=1T1ni=1n(yti,v,Btxti)+λv,v(1)


    where Bt are p×d matrices (typically pd ) linked to E by E=(BTB)1,Bdim=p×dT=[B1,,BT]concatenate by columns and wt=(Bt)Tvt
    Interpretation:
    – We learn a single function (x,t)ft(x) using the feature map (x,t)Bt(x) and corresponding multitask kernel K((x1,t1),(x2,t2))=Bt1x1,Bt2x2
    – Writing v,Btx=BTtv,x , we interpret this as having a single regression vector which is transformed by matrix Bt to obtain the task specific weight vector.
  • Link to the kernel methods (variance regulariser)
    The problem
    minw1,,wT1Tnt,i(yti,wt,xti)+λ(1Tt=1T||wt||2+1γγVar(w1,,wT))

    is equivalent to
    minw0,u1,,uT1Tnt,i(yti,w0+ut,xti)+λ(1γTt=1T||ut||2+11γ||w0||2)(2)

    by setting wt=w0+ut and minimise over w0 .
    It is of the form (1) with
    vBTtdim=(T+1)d×d=((1γ)12w0,(γT)12u1,,(γT)12uT)=[1γId×d,0d×d,,0d×dt-1,γTId×d,0d×d,,0d×dT-t]

    and the corresponding kernel K((x1,t1),(x2,t2))=(1γ+γTδt1t2)x1,x2
    By writing (2) as the following, it is more apparent that we regularise around some common vector w0
    minw01Tt=1Tminw{1ni=1n(yti,w,xti)+λγ||ww0||2}+λ1γ||w0||2
  • More multitask kernels

Structured sparsity

  • general sparsity regulariser
    ||W||2,1:=j=1dt=1Tw2tj

    – sum of the l2 norm of the row of matrix
    – encourages a matrix has only a few non-zero rows
    – regression vectors are sparse, but the sparsity pattern is contained in a small cardinality
    structured sparsity

Clustered MTL

Further topics

Transferring to new tasks

  • Having found a feature map h , to test it on the environment we
    1) draw a task μE
    2) draw a sample zμn
    3) run the algorithm to obtain a(h)z=f^h,zh
    4) measure the loss of a(h)z on a random pair (x,y)μ
  • The error associated with the algorithm a(h) is
    Rn(h)=EμEEzμnE(x,y)μ[(a(h)z(x),y)]
  • The best value for a representation h given complete knowledge of the environment is then
    minhHRn(h)
  • Compare to the very best we can do:

    R=minhHEμE[minfFE(x,y)μ(f(h(x)),y)]

  • The excess error associated with h is then Rn(h)R

Case of the variance regulariser

  • Training
    minw01Tt=1Tminw{1ni=1n(yti,w,xti+λγ||ww0||2}+λ1γ||w0||2
  • Testing
    minw1ni=1n(yi,w,xi)+λγ||ww0||2
  • Error
    Rn(w0)=EμEEzμnE(x,y)μ(y,w0+wz,x)
  • Best we can do
    R=minw0EμE[minwE(x,y)μ(y,w0+w,x)]
  • Excess error of w0 : Rn(w0)R

Informal reasoning

The feature map B learned from the training tasks can be used to learn a new task more quickly (a kind of bias learning heuristic).

  • Learn a new task by the method
    minv{1ni=1n(yt,v,Bxi)+λ2||v||22}

    • Give more weight to important features. In particular, if some eigenvalues of G=BB are zero, the corresponding eigenvectors are discarded when learning a new task.
    • In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
    • A statistical justification of an approach similar to this based on dictionary learning can be given.
    • Take home message

      • MLT objective function
      • regulariser
      • link to kernel trick

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值