[TMI 2024]Disentangle Then Calibrate With Gradient Guidance: A Unified Framework for Common and Rare-优快云博客

①Limitation: a) scarse data of rare disease; b) few-shot learning (FSL) only increases the performance on rare diseases but cannot performs well on both rare disease and common disease

②So they proposed the Disentangle then Calibrate with Gradient Guidance (DCGG) framework

2.2. Introduction

①Existing FSL methods: self-supervised learning (SSL), meta-learning, or metric-learning techniques

②⭐作者认为现有的三种办法会让模型对罕见病更敏感：比如SSL虽然在常见病上训练但是在罕见病上微调了；元学习也是学习了小样本的罕见病；指标学习通过比较指标来分类（我没有了解过这个诶，感觉很像原型学习，学习典型？）。然后可能会导致，比如在某个数据集里面，健康的人有1000个，常见病患者1000个，然后罕见病50个，模型结果是分类罕见病精度高但常见病就不高了？倒是一个很有意思的研究问题。能不能罕见病增强来copy paste成1000个啊哈哈哈哈哈哈哈但这样有点太假了

2.3. Related Work

2.3.1. FSL Techniques

①Model-driven methods, meta-learning and metric-learning, always rely on training strategy rather than data itself

②Data-driven methods, such as data augmentation, transfer feature from majority class to minority class

2.3.2. Rare Disease Diagnosis

①Related methods:

2.3.3. Relationship With Multi-Task Learning

①Different from multi-task learning (MTL), the authors focus on knowledge transfer and single task

2.4. Method

2.4.1. Overview

①The overall framework of DCGG:

②For dataset of $N$ image/label pairs $\mathcal{D}=\{\mathbf{x}_i,\mathbf{y}_i\}_{i=1}^N$ with medical image $\mathbf{x}_i$ and one-hot label $\mathbf{y}_i$

③Common-disease subset is represented by $\mathcal{D}_c$ and rare-disease subset is denoted by $\mathcal{D}_r$

2.4.2. GND Module

①Training mini-batch of each common disease and obtaining the gradient $g_c^i=\nabla_\theta\mathcal{L}_c^i(\theta)$ of each one (cross entropy loss).

②Getting an average gradient for all the diseases: $g_c^\star=\nabla_\theta\mathcal{L}_c^\star(\theta)$

③Mapping each disease to the average space:

$\begin{aligned} PL[j] & =\sum_{i=1}^{N_c}PL_{g_c^i[j]\to g_c^\star[j]} \\ & =\sum_{i=1}^{N_c}|g_c^i[j]|cos<g_c^i[j],g_c^\star[j]> \end{aligned}$

where $g_c^i[j]$ denotes the $j$ -th channel of $g_c^i$ . The channels which have higher value denote the consistency of diseases. Thus they define the highest $M$ channels of $PL$ as the disease-shared channels $C^{sh}$ and the left are disease-specific channels $C^{sp}$ （好新奇的视角是因为我平时不太看这方面的论文吗？）

④Gradient for common-disease and rare-disease: $g_c=\nabla_\theta\mathcal{L}_c(\theta,\mathcal{D}_c)$ and $g_r=\nabla_\theta\mathcal{L}_r(\theta,\mathcal{D}_r)$ , and they can be further decomposed to:

$g_c=\{g_c^{sh},g_c^{sp}\},\quad g_r=\{g_r^{sh},g_r^{sp}\}$

⑤作者想要进一步优化这两个梯度从而影响决策：⭐在共享通道上，作者想要只让罕见病检测效果好→降低罕见病检测的损失→但不影响常见病。此时， $g_r^{sh}$ 作为罕见病的共享通道需要被优化，并且找到的梯度 $w^{sh}$ 应该和 $g_r^{sh}$ 同一方向：

$min_{w^{sh}}||w^{sh}-g_{r}^{sh}||_{2}^{2},\quad s.t.g_{c}^{T}w^{sh}>0$

⑥The specific feature of each common disease: $g_c^{sp}=\{g_c^{sp\cdot1},\cdots,g_c^{sp\cdot N_c}\}$ . ⭐在特异性通道上，优化罕见病不能让它受到常见病的影响，因此需要让它与常见病特异性通道正交：

$\min_{w^{sp}}\|w^{sp}-g^{sp}\|_2^2,\quad s.t.G^{sp^T}w^{sp}=0.$

both of two functions are solved by Gram-Schmidt and Karush-Kuhn-Tucker condition（没学过，但查了查好像就是之前学的线代知识的延伸）

2.4.3. GFC Module

①They model $P$ and $Q$ as a discrete uniform distribution over $N_c$ common diseases and $N_r$ rare diseases（所以说这种分布是自己定的？为什么不是高斯分布啥的）:

$P=\sum_{i=1}^{N_c}\frac{1}{N_c}\delta_{M_c^i},\quad Q=\sum_{j=1}^{N_r}\frac{1}{N_r}\delta_{M_r^j},$

②For $M_c^i$ and $M_r^j$ , the feature of the $i$ -th common disease and the $j$ -th rare disease at the $l$ -th disease-shared channel, the transfer from common disease to rare disease can be described by optimal transport (OT) ptoblem:

$OT(P,Q) = \min_{T} <C,T>\\ s.t. (\vec{T}\vec{1}) = P, (T^T\vec{1}) = Q,$

where $T\in\mathbb{R}_{\geq0}^{N_c\times N_r}$ is the transport plan that needs to be solved, and $C\in\mathbb{R}_{\geq0}^{N_c\times N_r}$ is the cost matrix that points out the cost we should pay when linking a common disease to a rare one

③They measure the cost by Euclidean distance of gradients:

$C_{ij}=\left\|g_c^i-g_r^j\right\|_2^2$

④Assuming that the common-disease feature in each channel follows a Gaussian distribution, they utilize Sinkhorn to solve the OT problem and update mean and standard variance by:

$\mu_{r}^{j^{\prime}}=\frac{N_{c}\sum_{i\in N_{c}}T_{ij}\mu_{c}^{i}+\mu_{r}^{j}}{N_{c}+1},\sigma_{r}^{j^{\prime}}=\frac{N_{c}\sum_{i\in N_{c}}T_{ij}\sigma_{c}^{i}+\sigma_{r}^{j}}{N_{c}+1}.$

⑤The specific feature will be:

$M_r^{j^{\prime}}=\mu_r^{j^{\prime}}+\sigma_r^{j^{\prime}}*\varepsilon_t,\quad\varepsilon_t\in\mathcal{N}(0,1)$

2.4.4. Summary

①The algorithm:

2.5. Experiments

2.5.1. Dataset

①Statistics of datasets:

2.5.2. Implementation

①Backbone $\mathcal{F}(\theta)$ : WideResNet with detail:

②Optimizer: Adam with 0.001 learning rate

③Input image size: $112 \times 112$

④Batch size: $8$

⑤Cross validation: 5 fold for common disease training, 20% samples in training set were randomly selected to validation. Randomly selected $K=1 \, or\, 5$ rare disease samples for training and other 20% for validation. The remaining rare disease samples are test set. They executed 4 times non-repetitively sampling for each fold