Scanpy 是一个基于 Python 分析单细胞数据的软件包,内容包括预处理,可视化,聚类,拟时序分析和差异表达分析等。本文翻译自 scanpy 的官方教程 Preprocessing and clustering 3k PBMCs[1],用 scanpy 重现Seurat 聚类教程[2] 中的绝大部分内容。
0. scanpy 安装
Anaconda
# scanpyconda install-c bioconda scanpy# Leiden clustering packageconda install-c conda-forge leidenalg
安装 scanpy 时报错,搞了好久也没成功。。。重建环境也不行。
conda install-c bioconda scanpyCollecting packagemetadata(current_repodata.json): doneSolvingenvironment:failedwithinitial frozen solve. Retrying withflexible solve.Solvingenvironment:failedwithrepodatafromcurrent_repodata.json,willretry with nextrepodata source.Collecting packagemetadata(repodata.json): doneSolvingenvironment:failedwithinitial frozen solve. Retrying withflexible solve.Solvingenvironment:Foundconflicts! Looking forincompatible packages.Thiscan take several minutes. PressCTRL-C to abort.failedUnsatisfiableError: Thefollowing specifications were found to be incompatiblewitheach other:Output informat: Requested package -> Availableversions
PyPI
直接用 pip ,安装成功。
pip install scanpy[louvain]
Docker
docker pull fastgenomics/scanpy:1.4-p368-v1-stretch-slim
1. 载入数据
# 下载PBMC 数据集## 其实就是 Seurat 那个示例数据,之前下过就不用重复下了!mkdir data!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz!cd data;tar-xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
importnumpyasnpimportpandasaspdimportscanpyassc
/Users/baozhiwei/anaconda3/lib/python3.7/site-packages/anndata/core/anndata.py:17: FutureWarning:pandas.core.indexisdeprecatedandwill be removedina future version. The publicclasses are availableinthe top-levelnamespace.frompandas.core.indeximport RangeIndex
# verbosity 的取值表示测试结果显示的详细程度,数字越大越详细## errors (0), warnings (1), info (2), hints (3)sc.settings.verbosity= 3# 输出版本号sc.logging.print_versions()# set_figure_params 设置图片的分辨率/大小以及其他样式sc.settings.set_figure_params(dpi=80)
scanpy==1.4.5.post3anndata==0.6.22.post1umap==0.3.10numpy==1.18.1scipy==1.4.1pandas==1.0.1scikit-learn==0.22.1statsmodels==0.11.0python-igraph==0.7.1
# 设置结果文件保存路径results_file= './pbmc3k.h5ad'
# 导入 10X 数据adata=sc.read_10x_mtx('./data/filtered_gene_bc_matrices/hg19/', # 包含有 `.mtx` 文件的目录var_names='gene_symbols', # 用 gene symbols 作为变量名 (variables-axis index)cache=True) # 使用缓存文件加快读取
...writing an h5ad cache file to speedup readingnexttime</