回答参考qwen
你遇到的问题是:
> 在使用 Seurat v5 进行单细胞数据分析时,完成了 SCTransform 标准化、Harmony 批次校正、聚类和 UMAP 可视化后,在进行差异基因分析(`FindAllMarkers()`)时报错:
>
> ```
> 警告: No DE genes identified
> 警告: The following tests were not performed:
> 警告: When testing 0 versus all: Object contains multiple models with unequal library sizes. Run PrepSCTFindMarkers() before running FindMarkers()
> ```
>
> 尝试运行 `PrepSCTFindMarkers()` 后提示:
> ```
> minimum UMI unchanged. Skipped re-correct.
> ```
---
## 🔍 问题本质
这个错误的核心在于:**你的 Seurat 对象中存在多个独立的 SCT 模型(SCT model),它们是在每个样本分别做 `SCTransform` 时生成的,且这些模型之间的“平均 UMI 数”不一致。**
Seurat v5 的 `FindMarkers()` 函数在使用 `assay = "SCT"` 时要求:
✅ 必须对所有细胞统一使用一个 **全局共享的负二项回归模型(SCT model)**
❌ 如果存在多个不同 library size 假设下的模型 → 无法比较 → 报错跳过
而你目前的状态是:
- 每个样本单独做了 `SCTransform(..., sct = TRUE)` → 产生了多个独立的 `SCT` assay 中的 `model`
- 合并后没有重新拟合或整合这些模型
- 因此 `PrepSCTFindMarkers()` 检测到“已有模型”,但发现不能合并(因为 UMI 分布不同)→ 直接跳过
---
## ✅ 正确解决方案(关键步骤)
你需要 **放弃早期分样本的 SCT 模型**,改为在合并后 **统一执行一次全局 SCTransform**。
### ❌ 错误流程(你现在走的):
```r
for each sample:
CreateSeuratObject → SCTransform → DoubletFinder → merge
```
👉 导致多模型残留
### ✅ 正确流程(推荐做法):
```r
for each sample:
CreateSeuratObject → QC filtering
↓
merge all samples
↓
Run SCTransform ONCE on the merged object
↓
DoubletFinder (optional), PCA, Harmony, Clustering...
```
---
## ✅ 修改后的代码修正建议
以下是对你原始代码的关键修改部分(重点在第3步和第5步):
```r
# -----------------------------
# 1. 数据读取与元数据构建(不变)
# -----------------------------
data_dir <- "D:/GSEXXXX_RAW"
h5_files <- list.files(data_dir, pattern = "\\.h5$", full.names = TRUE)
meta_data <- data.frame(
file = h5_files,
sample_id = str_extract(basename(h5_files), "GSM\\d+"),
condition = str_extract(basename(h5_files), "(Ctrl|EcP|EcO)"),
stringsAsFactors = FALSE
)
# -----------------------------
# 2. 分样本创建Seurat对象及初步QC(只做QC,不做SCT!)
# -----------------------------
seurat_list <- lapply(seq_along(h5_files), function(i) {
counts <- Read10X_h5(meta_data$file[i])
obj <- CreateSeuratObject(
counts = counts,
project = meta_data$sample_id[i],
min.cells = 3,
min.features = 200
)
# 添加元数据
obj$condition <- meta_data$condition[i]
obj$sample_id <- meta_data$sample_id[i]
# 计算QC指标
obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")
obj[["percent.rb"]] <- PercentageFeatureSet(obj, pattern = "^RP[SL]")
obj[["percent.hb"]] <- PercentageFeatureSet(obj, pattern = "^HB[AB]")
# 初步QC过滤
obj <- subset(obj,
subset = nFeature_RNA > 500 &
nFeature_RNA < 6000 &
percent.mt < 20 &
percent.rb < 50 &
percent.hb < 0.5
)
return(obj)
})
names(seurat_list) <- meta_data$sample_id
saveRDS(seurat_list, file = "seurat_list.qc.rds")
# -----------------------------
# 3. 合并后再统一做 SCTransform(核心修改点!)
# -----------------------------
merged_seurat <- merge(seurat_list[[1]], y = seurat_list[-1]) # 先合并
# 在合并后的对象上运行 SCTransform(仅一次)
merged_seurat <- SCTransform(
merged_seurat,
vars.to.regress = c("percent.mt"),
new.genes = TRUE, # 是否添加新基因到 variable features
verbose = TRUE
)
# 可选:再次过滤双细胞(现在可以用 SCT assay)
merged_seurat <- RunPCA(merged_seurat, assay = "SCT", npcs = 20)
sweep.res <- paramSweep_v3(merged_seurat, PCs = 1:20, sct = TRUE)
sweep.stats <- summarizeSweep(sweep.res)
bcmvn <- find.pK(sweep.stats)
optimal_pK <- as.numeric(bcmvn$pK[which.max(bcmvn$BCmetric)])
n_cells <- ncol(merged_seurat)
doublet_rate <- 0.008 * (n_cells / 1000)
nExp <- max(round(doublet_rate * n_cells), 50)
merged_seurat <- doubletFinder(
merged_seurat,
PCs = 1:20,
pN = 0.25,
pK = optimal_pK,
nExp = nExp,
sct = TRUE
)
# 过滤双细胞
df_col <- grep("DF.classifications", colnames(merged_seurat@meta.data), value = TRUE)
merged_seurat <- subset(merged_seurat, cells = colnames(merged_seurat)[merged_seurat[[df_col]] == "Singlet"])
# -----------------------------
# 4. 整合并降维聚类(Harmony + UMAP)
# -----------------------------
# 提取可变特征用于整合
var_features <- SelectIntegrationFeatures(object.list = SplitObject(merged_seurat, split.by = "sample_id"), nfeatures = 3000)
VariableFeatures(merged_seurat) <- var_features
# 使用 SCT assay 运行 PCA
merged_seurat <- RunPCA(merged_seurat, assay = "SCT", npcs = 20, verbose = FALSE)
# 运行 Harmony 批次校正
merged_seurat <- RunHarmony(
merged_seurat,
group.by.vars = "sample_id",
assay.use = "SCT",
project.dim = FALSE,
plot_convergence = TRUE
)
# 构建邻居图 & 聚类
merged_seurat <- FindNeighbors(
merged_seurat,
reduction = "harmony",
dims = 1:20,
assay = "SCT",
graph.name = "harmony_snn"
)
merged_seurat <- FindClusters(
merged_seurat,
graph.name = "harmony_snn",
resolution = seq(0.2, 1.2, 0.2),
algorithm = 1,
verbose = FALSE
)
# UMAP
merged_seurat <- RunUMAP(
merged_seurat,
reduction = "harmony",
dims = 1:20,
reduction.name = "umap_harmony"
)
# 绘图
p1 <- DimPlot(merged_seurat, reduction = "umap_harmony", group.by = "seurat_clusters", label = TRUE)
p2 <- DimPlot(merged_seurat, reduction = "umap_harmony", group.by = "sample_id")
p1 | p2
# -----------------------------
# 5. 差异基因分析(现在可以正常运行!)
# -----------------------------
DefaultAssay(merged_seurat) <- "SCT"
# ✅ 现在 PrepSCTFindMarkers 应该会成功
merged_seurat <- PrepSCTFindMarkers(
merged_seurat,
assay = "SCT",
verbose = TRUE
)
# 查找所有标记基因
markers <- FindAllMarkers(
merged_seurat,
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.5,
test.use = "wilcox",
assay = "SCT"
)
# 提取每簇 top5
top_markers <- markers %>%
group_by(cluster) %>%
filter(p_val_adj < 0.05) %>%
slice_max(n = 5, order_by = avg_log2FC)
print(top_markers)
saveRDS(top_markers, file = "top_markers.rds")
# -----------------------------
# 6. 细胞注释(SingleR)
# -----------------------------
# 注意:SingleR 推荐使用 RNA assay 的 log-normalized data
# 所以我们要先从 SCT 模型获取表达值
# 方法一:提取 SCT 校正后的数据(log-normalized)
counts <- GetAssayData(merged_seurat, assay = "SCT", slot = "data") # log(x+1)
# 或方法二:切换回 RNA assay 并标准化
if (!"RNA" %in% names(merged_seurat@assays)) {
merged_seurat[['RNA']] <- CreateAssayObject(counts = GetAssayData(merged_seurat, assay = 'SCT', slot = 'counts'))
merged_seurat <- NormalizeData(merged_seurat, assay = 'RNA')
}
counts_rna <- GetAssayData(merged_seurat, assay = "RNA", slot = "data")
ref <- celldex::HumanPrimaryCellAtlasData()
pred <- SingleR(
test = counts, # 或用 counts_rna
ref = ref,
labels = ref$label.main,
clusters = merged_seurat$seurat_clusters
)
merged_seurat$singleR_labels <- pred$labels[match(merged_seurat$seurat_clusters, rownames(pred))]
# -----------------------------
# 7. 结果保存
# -----------------------------
saveRDS(merged_seurat, file = "integrated_seurat_final.rds")
write.csv(markers, "cluster_markers.csv", row.names = FALSE)
pdf("QC_plots.pdf")
VlnPlot(merged_seurat, features = c("nFeature_RNA", "percent.mt"), group.by = "sample_id")
DimPlot(merged_seurat, group.by = "singleR_labels", label = TRUE)
FeaturePlot(merged_seurat, features = c("CD3D", "CD19", "EPCAM"))
dev.off()
```
---
## 📌 关键要点总结
| 问题 | 解决方案 |
|------|----------|
| ❌ 多个 SCT 模型导致 `FindMarkers` 失败 | ✅ 合并后再统一只做一次 `SCTransform` |
| ⚠️ `PrepSCTFindMarkers()` 被跳过 | ✅ 因为已有模型未更新 → 改为后期统一建模 |
| 💡 是否还能用 Harmony? | ✅ 可以,只要在 `SCT` assay 上运行即可 |
| 🧪 双细胞检测怎么做? | ✅ 在 `SCT` + PCA 后进行,使用 `doubletFinder(sct = TRUE)` |
| 📊 SingleR 用什么数据? | ✅ 推荐用 `SCT` assay 的 `data` slot(log-normalized) |
---
## ✅ 额外建议
- 若你坚持想保留早期 SCT 模型(如资源有限无法重跑),可尝试:
```r
# 强制清除旧模型
merged_seurat[['SCT']]@misc$prepped.markers <- NULL
merged_seurat[['SCT']]@model <- NULL
```
然后再运行 `PrepSCTFindMarkers()`,但风险较高,不如重做可靠。
- 推荐阅读 Seurat 官方最新教程(v5):
https://satijalab.org/seurat/articles/integration_introduction.html
---
完成以上修改后,你的 `FindAllMarkers()` 将能正常运行并输出各簇的 marker genes。
---