[arXiv 2024]MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model-优快云博客

①"captioning could facilitate visual decoding by approximatively simulating the reverse process of perceiving stimuli" 字幕预测为什么是视觉感知刺激逆向过程？感觉在默认人看到图片就一定会内心中有语义的解答，而且有些ADHD的看了不和没看一样（虽然一般不收录这种

②"low-level visual processing and high-level semantic processing"为什么看图片刺激是低级视觉刺激？还有高级语义？为什么不认可人的语言功能自动包含了低级语义（比如人已经会语法了）

③Recent works utilize ridge regression to extract fMRI feature（这股风到底怎么兴起的！！确实到处都在岭回归）

④Pipeline of MindSemantix:

（如果作者只认同高级语义那么这右边的output text不会只是key words吗）

2.3. MindSemantix

①Specific pre-training and training processes:

（哥们儿上半和BrainChat好相似啊哈哈哈虽然可能也是常见操作了）（为什么预训练对着mask之后的fMRI进行编码但是到了下面绿框是直接对原始信号进行编码了不会有问题吗！！）

2.3.1. Pre-training: Self-supervised Brain-Encoder-Decoder (BED)

①Reason for mask and reconstruction fMRI signal: spatial blurring brought by hemodynamic response and spatial smoothing

②L2 loss for brain signal (patch-wise) reconstruction:

$\mathcal{L}_{BED}=\sum_{i=1}^{N_{all}}\mathcal{L}_{2}(\mathbf{x}_{i}^{\prime},\mathbf{x}_{i})=\sum_{i=1}^{N}\|\mathbf{x}_{i}-\mathbf{x}^{\prime}\|_{2}$

where $\mathbf{x}_{i}$ denotes real fMRI token and $\mathbf{x}^{\prime}$ denotes recovered

2.3.2. Training: End-to-end Brain-Language Model (BLM)

①fMRI projector: a fully-connected (FC) layer and a 1D convolutional layer

②Text projector: FC in BLIP-2, the same weights

③Caption generation loss:

$\mathcal{L}_{BLM}=\sum_{i=1}^{N}\mathcal{L}_{OPT}(\mathbf{c}_{i}^{\prime},\mathbf{c}_{i})=\sum_{i=1}^{N}\left[\sum_{j=1}^{M}\mathcal{L}_{OPT}(\mathrm{BLM}(\mathbf{x}_{i}),\mathbf{c}_{ij})\right]$

where $\mathbf{c}_{i}$ and $\mathbf{c}_{i}^{\prime}$ is real caption and predicted caption, $M$ is the caption number of each image in COCO dataset ( $M=5$ )