[NeurIPS 2023]Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generativ-优快云博客

①They proposed Brain Diffusion for Visual Exploration (BrainDiVE), which generates images by brain signals and proves ROI will perform differently in different visual tasks

2.2. Introduction

①Inference that higher visual cortex preferentially process complex semantic categories is come from manully stimuli and specific scene

②Maximum voxels of corresponding function of identify different categories. All these images are generated by BrainDiVE:

2.3. Related work

①Designed stimuli is different from natural stimuli

②Lists deep generative models such as variational autoencoders, generative adversarial networks, flows and score/energy/diffusion models. ⭐But all of these are for reconstruction rather than predicting stimuli（还没往下看，不过预测出会刺激大脑某个地方的图片该怎么去验证真的能刺激呢）

③They focus on complex scenes

macaque n.猕猴

2.4. Methods

①Steps of synthesizing images that activate a target brain region:

they only show the scene-selective regions (RSC/PPA/OPA) on the right hemisphere

2.4.1. Background on Diffusion Models

①Sampling from data distribution $p\left ( x \right )$ , and generate ${x_{T}}\sim\mathcal{N}(0,\mathbb{I})$ , $x_{T-1},x_{T-2},x_{T-3}\ldots x_0$ step by step, where $x_{t}=\sqrt{\alpha_{t}}x_{0}+\epsilon\sqrt{1-\alpha_{t}}$

②Mean-squared error (MSE) is the loss of autoencoder network $\epsilon_{\theta}(x_{t},t)$

③Autoencoder $D_{\Omega}(E_{\Phi}(\mathcal{I}))$ consists of encoder $E_\Phi$ and decoder $D_\Omega$

④Diffusion model: pretrained latent diffusion model (LDM)

2.4.2. Brain-Encoding Model Construction

①Mapping learning:

$M_{\theta}(\mathcal{I})\Rightarrow B$

where $\mathcal{I}\in\mathbb{R}^{3\times H\times W}$ is image, $M_\theta$ denotes voxel-wise brain encoding model, $B\in\mathbb{R}^{N}$ is the fMRI $\beta$ values with $N$ elements

②Components of encoder: the first one is CLIP trained image encoder, which outputs $K$ dimensional vector as latent embedding; the second one is a euclidean normalization and linear adaptation layer $W\in\mathcal{R}^{N\times K},b\in\mathcal{R}^{N}$

③The function of the whole encoding operations:

$B\approx M_\theta(\mathcal{I})=W\times\frac{\mathrm{CLIP}_{\mathrm{img}}(\mathcal{I})}{\|\mathrm{CLIP}_{\mathrm{img}}(\mathcal{I})\|_2}+b$

2.4.3. Brain-Guided Diffusion Model

①Conditioning is done in one of two ways in conventional text-conditioned diffusion models:

The first approach modifies the function $\epsilon _\theta$ to further accept a conditioning vector $c$ , resulting in $\epsilon_{\theta}(x_{t},t,c)$ .
The second approach uses a contrastive trained image-to-concept encoder, and seeks to maximize a similarity measure with a text-to-concept encoder.

②Maximizing the average activation of $S$ predicted by $M_\theta$ to condition the diffusion model:

$\epsilon_{theta}^{\prime}=\epsilon_{theta}-\sqrt{1-\alpha_{t}}\nabla_{x_{t}}(\frac{\gamma}{|S|}\sum_{i\in S}M_{\theta}(D_{\Omega}(x_{t}^{\prime}))_{i})$

where $\gamma$ is a scale, $S\subseteq N$ are the set of voxels used for guidance（为什么是先解码再编码了？不太了解扩散的改进）

③Employing euler approximation to get images with low noise:

$\begin{aligned} & \hat{x}_{0}=\frac{1}{\sqrt{\alpha}}(x_{t}-\sqrt{1-\alpha}\epsilon_{t}) \\ & x_{t}^{\prime}=(\sqrt{1-\alpha})\hat{x}_{0}+(1-\sqrt{1-\alpha})x_{t} \end{aligned}$

2.5. Results

2.5.1. Setup

①Dataset: the Natural Scenes Dataset (NSD)

②Subject: they choose 4 of total 7, cuz S1, S2, S5, and S7 have the watched 10000 natural scene images repeated three times in their entirety

③Images in NSD: MS COCO

④Feature of fMRI: $\beta$ value calculated by GLMSingle

⑤Voxel norm: $\mu=0$ and $\sigma=1$ per session

⑥Applying average on repeat session of fMRI signal

⑦Data split: 9:1 for tr/test

⑧啥？？？每个人在V100上训练了1500个小时？？？62天？？？

⑨Diffusion base: stable-diffusion-2-1-base, which produces images of 512 × 512 resolution using ϵ-prediction

⑩Using multi-step 2nd order DPM-Solver++ with 50 steps and apply 0.75 SAG（什么玩意这是）

⑪Step size hyperparameter: $\gamma =130.0$

⑫Brain encoder: ViT-B/16 with 224×224 output size

⑬Prompt: null prompt ""

⑭CLIP probes: CoCa ViT-L/14 (LAION-2B)

2.5.2. Broad Category-Selective Networks

①Category selective voxels:

②The top-5 images which have the highest average activation:

③Semantic specificity of images generated by BrainDiVE and natrual image:

2.5.3. Individual ROIs

①Generation ability of OFA and FFA:

②Performance:

2.5.4. Semantic Divisions within ROIs

①Clustering within the food ROI and within OPA:

②Visualization of generated image of sub class S1 and S2:

③Subsets of OPA:

④Performance:（表的title很清楚了我就不描述了）

2.6. Discussion

3. Reference

@article{luo2023brain,
  title={Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models},
  author={Luo, Andrew F and Henderson, Margaret M and Wehbe, Leila and Tarr, Michael J},
  journal={arXiv preprint arXiv:2306.03089},
  year={2023}
}
}