[ICASSP 2025]BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision

最新推荐文章于 2025-04-03 22:54:02 发布

夏莉莉iy

最新推荐文章于 2025-04-03 22:54:02 发布

阅读量723

点赞数 17

分类专栏：论文精读文章标签：人工智能语言模型深度学习机器学习自然语言处理神经网络 transformer

本文链接：https://blog.youkuaiyun.com/Sherlily/article/details/146937960

版权

论文精读专栏收录该内容

152 篇文章

订阅专栏

论文网址：BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision-Language Pretrained Models | IEEE Conference Publication | IEEE Xplore

论文代码：GitHub - HuangWanqiu/BrainChat-Code: BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.1. Implementation

2.4.2. fMRI Captioning Evaluation

2.4.3. fQA Evaluation

2.4.4. Adapting BrainChat without Image Data

2.5. Conclusion

3. Reference

1. 心得

（1）你就是我命定的paper吗？先不能如此下结论，很怀疑代码“key code”到底有没有我想要的东西

2. 论文逐段精读

2.1. Abstract

①Task: encode the semantic information of fMRI

2.2. Introduction

①Task of BrainChat:

②Integrated techniques: Contrastive Captioner (CoCa) and Masked Brain Modeling (MBM)

③Training mode: encoder/decoder pretraning and regression decoding

aphasia n.失语症；失语(症)

2.3. Method

①Framework of BrainChat:

②Pre-training stage: pretraining encoder $f_\theta$ and decoder by MBM. Patchify fMRI data and set masked patches to 0. Reconstruct masked data by mean squared error (MSE).

③The encodered fMRI data is project to align with image/text embedding extracted by frozen image encoder $g_\theta$ and text encoder $q_\theta$ from CoCa

④fMRI-image contrastive loss $L_{fi}$ , fMRI-text contrastive loss $L_{ft}$ :

$L_{fi}=-(\underbrace{\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}g_{\boldsymbol{\theta}}(v_{i})/\sigma)}{\sum_{j=1}^{N}\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}g_{\boldsymbol{\theta}}(v_{j}))}}_{\mathrm{fMRI-to-imange}})+\underbrace{\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(g_{\boldsymbol{\theta}}(v_{i})^{T}p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))/\sigma)}{\sum_{j=1}^{N}\exp(g_{\boldsymbol{\theta}}(v_{i})^{T}f_{\boldsymbol{\theta}}(b_{j})/\sigma)}))}_{\mathrm{image-to-fMRI}}$

$L_{ft}=-(\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}q_{\boldsymbol{\theta}}(t_{i})/\sigma)}{\sum_{j=1}^{N}\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}q_{\boldsymbol{\theta}}(t_{j})/\sigma)})+\underbrace{\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(q_{\boldsymbol{\theta}}(t_{i})^{T}p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))/\sigma)}{\sum_{j=1}^{N}\exp(q_{\boldsymbol{\theta}}(t_{i})^{T}f_{\boldsymbol{\theta}}(b_{j})/\sigma)}))}_{\text{text-6-6MRI}}$

where $(b_{i},v_{i},t_{i})$ denotes paired fMRI-images-text data, $p_{\theta}(f_{\theta}(b_{i})),g_{\theta}(v_{i})$ and $p_{\theta}(t_{i})$ is the embeddings of the fMRI, image and text in the $i$ -th pair, $N$ is batch size and $\sigma$ is temperature parameter

⑤Caption loss $L_{cap}$ :

$L_{cap}=-\sum_{k=1}^T\log P_\theta\left(t_k|t<k,b\right)$

where $P_{\theta}(t_{k}|t<k,b)$ represents the probability of generating text $t_k$ conditioned on text from previous time steps $t< k$ and the fMRI data $b$

⑥Total loss:

$L_{BrainChat}=\lambda_{fi}L_{fi}+\lambda_{ft}L_{ft}+\lambda_{\mathrm{Cap}}L_{cap}$

where $\lambda$ s are weights

⑦fMRI captioning task: input the first $k$ text words in caption and predict text word at time $k+1$

⑧fQA task: the question is set as text encoder, e.g. "Question: What color is the water? Answer:"

⑨Caption generation (ignores image encoder and corresponding loss):

where greens are generated caption and reds are gramma error

2.4. Experiment

2.4.1. Implementation

①Dataset 1: subject 1 of Natural Scenes Dataset (NSD), including 15,724 voxel and captions from COCO for fMRI captioning. NSD and VQA datasets are combined to achieve fQA task

②Dataset 2: HCP for predict masked fMRI in pre-training stage

③Encder and decoder: ViT

④Mask ratio: 0.75

⑤Hyper parameters during pre-training: 5e-10 learning rate with 0.05 weight decay, AdamW with $\beta _1=0.9$ and $\beta _2=0.95$ with NativeScaler gradient scaling

⑥Hyper parameters during brain decoder: 1e-4 learning rate with 0.1 weight decay

⑦Loss weight: 20 and 1 for caption loss and constractive loss

2.4.2. fMRI Captioning Evaluation

①Captioning performance:

②Quantity measurement of fMRI captioning:

2.4.3. fQA Evaluation

①fQA performance:

②Results of fQA:

where greens are answers

2.4.4. Adapting BrainChat without Image Data

①没图像表现也ok

2.5. Conclusion

3. Reference

@INPROCEEDINGS{10889434,
author={Huang, Wanqiu and Ma, Ke and Xie, Tingyu and Wang, Hongwei},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision-Language Pretrained Models},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Training;Semantics;Functional magnetic resonance imaging;Signal processing;Brain modeling;Question answering (information retrieval);Decoding;Data mining;Speech processing;Software development management;fMRI question answering;fMRI captioning;fMRI decoding;large-scale vision-language model;human-computer interaction},
doi={10.1109/ICASSP49660.2025.10889434}}