Multi-task Learning_jointly multi-task-优快云博客

本文介绍了多任务学习的基本概念，探讨了其数学表述，并讨论了几种关键的正则化方法，如方差正则化和结构稀疏性。此外，文章还深入研究了多任务学习与核方法之间的联系。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基于Supervised Learning Lecture 8

Multi-task learning
- Mathematical formulation
Linear MTL
- Regularisers for linear MTL
  - Quadratic regulariser
  - Structured sparsity
Clustered MTL
Further topics
- Transferring to new tasks
  - Case of the variance regulariser
  - Informal reasoning
Take home message

Multi-task learning

Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
Typical scenario: many tasks many tasks but only few examples per task. If $n < d$ we don’t have enough data to learn the tasks one by one. However, if the tasks are related and set $\cal S$ or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL).
When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
Applications: Learning a set of linear classi ers for related objects
(cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech…
Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
The ideas can be extended to non-linear cases through RKHS.

Mathematical formulation

Fix probability measures $\mu_1, \cdots, \mu_T$ on $\mathbb{R}^d \times \mathbb{R}$
– T tasks
– Each task is a probability measure, e.g. $\mu_t(x,y)=P(x)\delta(\left\langle w^*,x\right\rangle-y)$ . $\delta$ is a deterministic function, interpreted as the conditional probability and $w^x$ is an underlying parameter
– $\mathbb{R}^d$ can also be a Hilbert space
Draw data: $(x_{t1 \> \textit{vector}},y_{t1 \> \textit{scalar}}), \cdots,(x_{tn},y_{tn}) \sim \mu_t, \quad t=1, \cdots, T$ (in practice n may vary with t)
Learning method:

$min (f 1, \dots, f T) \in F 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, f t (x t i))$ $\min_{(f_1,\cdots,f_T) \in \cal F} \frac{1}{T} \sum_{t=1}^T \frac{1}{n} \sum_{i=1}^n \ell(y_{ti},f_t(x_{ti}))$

where $\cal F$ is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components.
Goal is to minimise the multi-task error

$R (f 1, \dots, f T) = 1 T \sum t = 1 T E (x, y) \sim μ t ℓ (y t i, f t (x t i))$ $R(f_1,\cdots,f_T)=\frac{1}{T} \sum_{t=1}^T \underset{(x,y) \sim \mu_t}{\mathbb{E}} \ell(y_{ti},f_t(x_{ti}))$

Linear MTL

“task” = “linear model”
– Regression: $y_{ti}=\left\langle w_t^*, x_{ti} \right\rangle + \epsilon_{ti}$
– Binary classification: $y_{ti}=sign(\left\langle w_t^*, x_{ti} \right\rangle) \epsilon_{ti}$
Learning method: $\min_{(w_1,\cdots,w_T) \in \cal S} \frac{1}{T} \sum_{t=1}^T \frac{1}{n} \sum_{i=1}^n \ell(y_{ti},\left\langle w_t^*, x_{ti} \right\rangle)$ . Here, $\cal S$ incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser.
The multitask error of $W=[w_1, \cdots,w_T]$ is: $R(W)=\frac{1}{T} \sum_{t=1}^T \underset{(x,y) \sim \mu_t}{\mathbb{E}} \ell(y_{ti},\left\langle w_t,x \right\rangle)$
It is possible to give bounds on the uniform deviation
$sup W \in S {R (W) - 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ w t, x t i ⟩)}$ $\sup_{W \in \cal S} \{ R(W)-\frac{1}{T} \sum_{t=1}^T \frac{1}{n} \sum_{i=1}^n \ell(y_{ti},\left\langle w_t, x_{ti} \right\rangle) \}$
and derive bounds for excess error
$R (W^) - min W \in S R (W)$ $R(\hat W)-\min_{W \in \cal S}R(W)$

Regularisers for linear MTL

Often we drop the constraint (i.e. $W \in \cal S$ ) and consider the penalty methods

min w 1, \dots, w T 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ w t, x t i ⟩) + λ Ω (w 1, \dots, w T)

$\min_{w_1,\cdots,w_T}\frac{1}{T} \sum_{t=1}^T \frac{1}{n} \sum_{i=1}^n \ell(y_{ti},\left\langle w_t, x_{ti} \right\rangle)+\lambda\Omega(w_1,\cdots,w_T)$
Different regularisers encourage different types of commonalities between the tasks:

variance (or other convex quadratic regularisers) encourage closeness to mean
$Ω var = 1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ V a r (w 1, \dots, w T)$ $\Omega_\text{var}=\frac{1}{T}\sum_{t=1}^{T}\left|\left| w_t \right| \right|^2+\frac{1-\gamma}{\gamma}Var(w_1,\cdots,w_T)$
Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
$| | W | | 2, 1 : = \sum j = 1 d \sum t = 1 T w 2 t j - - - - - -  ⎷  $ $\left|\left|W\right|\right|_{2,1}:=\sum_{j=1}^d\sqrt{\sum_{t=1}^Tw_{tj}^2}$
Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features
$| | w 1, \dots, w T | | t r$ $\left|\left|w_1,\cdots,w_T\right|\right|_{tr}$
– extension of joint sparsity; rotate the initial data representation
– The $l_1$ norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace)
More sophisticated regularisers which combine the above, promote clustering of tasks, etc.

Quadratic regulariser

general quadratic regulariser
$Ω var = \sum s, t = 1 T ⟨ w s, E s t w t ⟩$ $\Omega_\text{var}=\sum_{s,t=1}^{T} \left\langle w_s,E_{st}w_t \right\rangle$
where the matrix $E=(E_{st})_{s,t=1}^{T} \in \mathbb{R}^{dT \times dT}$ is positive definite.
variance regulariser
Let $\gamma \in [0,1]$ and
$Ω var = 1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ V a r (w 1, \dots, w T) = 1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ \sum t = 1 T | | w t - w ¯ | | 22$ $\begin{align} \Omega_\text{var} &=\frac{1}{T}\sum_{t=1}^{T}\left|\left| w_t \right| \right|^2+\frac{1-\gamma}{\gamma}Var(w_1,\cdots,w_T)\\ &=\frac{1}{T}\sum_{t=1}^{T}\left|\left| w_t \right| \right|^2+\frac{1-\gamma}{\gamma} \sum_{t=1}^{T} \left|\left| w_t - \bar w \right|\right|^2_2 \end{align}$
– $\gamma=1$ : independent tasks; $\gamma=0$ : identical tasks
– regulariser favours weight vectors which are close to its mean.
– If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other)
Link to the kernel methods (quadratic regulariser)
The problem
$min w 1, \dots, w T 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ w t, x t i ⟩) + λ \sum s, t = 1 T ⟨ w s, E s t w t ⟩$ $\min_{w_1,\cdots,w_T}\frac{1}{T} \sum_{t=1}^T \frac{1}{n} \sum_{i=1}^n \ell(y_{ti},\left\langle w_t, x_{ti} \right\rangle)+\lambda\sum_{s,t=1}^{T} \left\langle w_s,E_{st}w_t \right\rangle$
is equivalent to

$min v 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ v, B t x t i ⟩) + λ ⟨ v, v ⟩ (1)$ $\min_v\frac{1}{T} \sum_{t=1}^T \frac{1}{n} \sum_{i=1}^n \ell(y_{ti},\left\langle v, B_t x_{ti} \right\rangle)+\lambda \left\langle v,v \right\rangle \tag{1}$

where $B_t$ are $p \times d$ matrices (typically $p \gg d$ ) linked to $E$ by $E=(B^TB)^{-1}, B_{\textit{dim=p×dT}}=[B_1, \cdots, B_T]_{\textit{concatenate by columns}}$ and $w_t=(B_t)^T v_t$
Interpretation:
– We learn a single function $(x,t) \mapsto f_t(x)$ using the feature map $(x,t) \mapsto B_t(x)$ and corresponding multitask kernel $K((x_1,t_1),(x_2,t_2))=\left\langle B_{t1}x_1, B_{t2}x_2 \right\rangle$
– Writing $\left\langle v,B_tx\right\rangle = \left\langle B_t^Tv,x\right\rangle$ , we interpret this as having a single regression vector which is transformed by matrix $B_t$ to obtain the task specific weight vector.
Link to the kernel methods (variance regulariser)
The problem
$min w 1, \dots, w T 1 T n \sum t, i ℓ (y t i, ⟨ w t, x t i ⟩) + λ (1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ V a r (w 1, \dots, w T))$ $\min_{w_1,\cdots,w_T}\frac{1}{Tn} \sum_{t,i} \ell(y_{ti},\left\langle w_t, x_{ti} \right\rangle)+\lambda(\frac{1}{T}\sum_{t=1}^{T}\left|\left| w_t \right| \right|^2+\frac{1-\gamma}{\gamma}Var(w_1,\cdots,w_T))$
is equivalent to
$min w 0, u 1, \dots, u T 1 T n \sum t, i ℓ (y t i, ⟨ w 0 + u t, x t i ⟩) + λ (1 γ T \sum t = 1 T | | u t | | 2 + 1 1 - γ | | w 0 | | 2) (2)$ $\min_{w_0,u_1,\cdots,u_T}\frac{1}{Tn} \sum_{t,i} \ell(y_{ti},\left\langle w_0+u_t,x_{ti} \right\rangle)+\lambda (\frac{1}{\gamma T} \sum_{t=1}^T \left|\left| u_t \right|\right|^2+\frac{1}{1-\gamma} \left|\left| w_0 \right|\right|^2 ) \tag{2}$
by setting $w_t=w_0+u_t$ and minimise over $w_0$ .
It is of the form (1) with
$v B T t d i m = (T + 1) d \times d = ((1 - γ) - 1 2 w 0, (γ T) - 1 2 u 1, \dots, (γ T) - 1 2 u T) = [1 - γ - - - - \sqrt I d \times d, 0 d \times d, \dots, 0 d \times d                t-1, γ T - - - \sqrt I d \times d, 0 d \times d, \dots, 0 d \times d                T-t]$ $\begin{align} v&=((1-\gamma)^{-\frac{1}{2}}w_0,(\gamma T)^{-\frac{1}{2}}u_1, \cdots, (\gamma T)^{-\frac{1}{2}}u_T) \\ B_{t \> \textit{dim=(T+1)d×d}}^T&=[\sqrt{1-\gamma}\mathbf{I}_{d \times d}, \underbrace{\mathbf{0}_{d \times d}, \cdots, \mathbf{0}_{d \times d}}_\text{t-1}, \sqrt{\gamma T}\mathbf{I}_{d \times d}, \underbrace{\mathbf{0}_{d \times d}, \cdots, \mathbf{0}_{d \times d}}_\text{T-t}] \end{align}$
and the corresponding kernel $K((x_1,t_1),(x_2,t_2))=(1-\gamma + \gamma T \delta_{t_1t_2})\left\langle x_1,x_2 \right\rangle$
By writing (2) as the following, it is more apparent that we regularise around some common vector $w_0$
$min w 0 1 T \sum t = 1 T min w {1 n \sum i = 1 n ℓ (y t i, ⟨ w, x t i ⟩) + λ γ | | w - w 0 | | 2} + λ 1 - γ | | w 0 | | 2$ $\min_{w_0}\frac{1}{T} \sum_{t=1}^{T} \min_{w} \{\frac{1}{n}\sum_{i=1}^{n} \ell(y_{ti},\left\langle w,x_{ti} \right\rangle )+ \frac{\lambda}{\gamma} \left|\left| w-w_0 \right|\right|^2 \} +\frac{\lambda}{1-\gamma} \left|\left| w_0 \right|\right|^2$
More multitask kernels

Structured sparsity

general sparsity regulariser
$| | W | | 2, 1 : = \sum j = 1 d \sum t = 1 T w 2 t j - - - - - -  ⎷  $ $\left|\left|W\right|\right|_{2,1}:=\sum_{j=1}^d\sqrt{\sum_{t=1}^Tw_{tj}^2}$
– sum of the $l_2$ norm of the row of matrix
– encourages a matrix has only a few non-zero rows
– regression vectors are sparse, but the sparsity pattern is contained in a small cardinality

Clustered MTL

Further topics

Transferring to new tasks

Having found a feature map $h$ , to test it on the environment we
1) draw a task $\mu \sim \cal E$
2) draw a sample $\mathbf{z} \sim \mu^n$
3) run the algorithm to obtain $a \> (h)_\mathbf{z}={\hat f}_{h,\mathbf{z}} \circ h$
4) measure the loss of $a \> (h)_\mathbf{z}$ on a random pair $(x,y) \sim \mu$
The error associated with the algorithm $a \> (h)$ is
$R n (h) = E μ \sim E E z \sim μ n E (x, y) \sim μ [ℓ (a (h) z (x), y)]$ $R_n(h)=\mathbb{E}_{\mu \sim \cal E} \mathbb{E}_{\mathbf{z} \sim \mu^n} \mathbb{E}_{(x,y) \sim \mu} [\ell(a \>(h)_z(x),y)]$
The best value for a representation $h$ given complete knowledge of the environment is then
$min h \in H R n (h)$ $\min_{h \in \cal H}R_n(h)$
Compare to the very best we can do:

$R * = min h \in H E μ \sim E [min f \in F E (x, y) \sim μ ℓ (f (h (x)), y)]$ $R_*=\min_{h \in \cal H}\mathbb{E}_{\mu \sim \cal E} [\min_{f \in \cal F} \mathbb{E}_{(x,y) \sim \mu} \ell(f(h(x)),y)]$
The excess error associated with $h$ is then $R_n(h)-R_*$

Case of the variance regulariser

Training
$min w 0 1 T \sum t = 1 T min w {1 n \sum i = 1 n ℓ (y t i, ⟨ w, x t i ⟩ + λ γ | | w - w 0 | | 2} + λ 1 - γ | | w 0 | | 2$ $\min_{w_0}\frac{1}{T} \sum_{t=1}^{T} \min_{w} \{\frac{1}{n}\sum_{i=1}^{n} \ell(y_{ti},\left\langle w,x_{ti} \right\rangle+ \frac{\lambda}{\gamma} \left|\left| w-w_0 \right|\right|^2 \} +\frac{\lambda}{1-\gamma} \left|\left| w_0 \right|\right|^2$
Testing
$min w 1 n \sum i = 1 n ℓ (y i, ⟨ w, x i ⟩) + λ γ | | w - w 0 | | 2$ $\min_{w} \frac{1}{n}\sum_{i=1}^{n} \ell(y_{i},\left\langle w,x_{i} \right\rangle )+ \frac{\lambda}{\gamma} \left|\left| w-w_0 \right|\right|^2$
Error
$R n (w 0) = E μ \sim E E z \sim μ n E (x, y) \sim μ ℓ (y, ⟨ w 0 + w z, x) ⟩$ $R_n(w_0)=\mathbb{E}_{\mu \sim \cal E} \mathbb{E}_{\mathbf{z} \sim \mu^n} \mathbb{E}_{(x,y) \sim \mu} \ell (y,\left\langle {w_0+w_\mathbf{z},x}) \right\rangle$
Best we can do
$R * = min w 0 E μ \sim E [min w E (x, y) \sim μ ℓ (y, ⟨ w 0 + w, x) ⟩]$ $R_*=\min_{w_0}\mathbb{E}_{\mu \sim \cal E} [\min_{w} \mathbb{E}_{(x,y) \sim \mu} \ell (y,\left\langle {w_0+w,x}) \right\rangle]$
Excess error of $w_0$ : $R_n(w_0)-R_*$

Informal reasoning

The feature map $B$ learned from the training tasks can be used to learn a new task more quickly (a kind of bias learning heuristic).

Learn a new task by the method

minv{1n∑i=1nℓ(yt,⟨v,B∗xi⟩)+λ2||v||22}
- Give more weight to important features. In particular, if some eigenvalues of $G=B^*B$ are zero, the corresponding eigenvectors are discarded when learning a new task.
- In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
- A statistical justification of an approach similar to this based on dictionary learning can be given.
- Take home message
  - MLT objective function
  - regulariser
  - link to kernel trick
  1. Multi-task learning, wikipedia
    https://en.wikipedia.org/wiki/Multi-task_learning ↩