TCGAbiolinks的使用

原创已于 2022-09-10 15:24:36 修改 · 1.2w 阅读

61 ·

CC 4.0 BY-SA版权

文章标签：

#java #linux #运维

于 2020-02-17 01:04:16 首次发布

R 专栏收录该内容

2 篇文章

订阅专栏

本文介绍TCGAbiolinks工具的使用方法，包括从GDC下载和预处理临床、基因表达及突变数据，以及如何利用这些数据进行生物信息学分析。

仅供学习交流

更新历史
- - created：2019
  - updated：2022
介绍
Demo Code

更新历史

created：2019

当初使用TCGAbiolinks由官方文档翻译过来，目前仍有相关专业初学者在使用此工具故此更新部分内容，注意查阅最新官方文档；

updated：2022

目前此工具本人已不再使用，补充下当时写的一段代码，内有本人注释，方便初学者直观理解；

介绍

一、官方文档介绍

TCGAbiolinks is able to access The National Cancer Institute (NCI) Genomic Data Commons (GDC) thorough itsGDC Application Programming Interface (API) to search, download and prepare relevant data for analysis in R.

二、安装

可以从bioconductor上下载安装稳定版或开发版的TCGA插件。
1.版本安装
1.1 稳定版

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")

1.2 开发版（如果稳定版使用存在问题，尝试使用开发版）

devtools::install_github('BioinformaticsFMRP/TCGAbiolinks')

2.1 可使用Docker image
如果想使用docker image（一个极度精简版的Linux程序运行环境），也是适用于TCGA。
For more information please check: https://docs.docker.com/ and https://www.bioconductor.org/help/docker/

3.1 需要的库

根据实际情况导入库，例如：
library(TCGAbiolinks)
library(dplyr)
library(DT)

4.1 Session info

version
packageVersion("TCGAbiolinks")

三、GDC clinical data的相关介绍

在GDC数据库里，可以从不同的源来获得clinical data。
1）Indexed clinical：从XML文件里提炼出来的clinical data（提炼指indexed数据是更新过的数据）；
2）XML：原数据（包含各种信息如radiation, drugs information, follow-ups, biospecimen, etc. ）；
3）BCR Biotab：从XML里面解析出来的tsv文件（格式化过的数据）；

1.1 从BCR Biotab里面抓取数据

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
names(clinical.BCRtab.all)

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", `在这里插入代码片`
                  data.format = "BCR Biotab",
                   file.type = "radiation")
GDCdownload(query)
clinical.BCRtab.radiation <- GDCprepare(query)

clinical.BCRtab.all$clinical_drug_acc  %>% 
  head  %>% 
  datatable(options = list(scrollX = TRUE, keys = TRUE))

1.2 从indexed data抓取数据

clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")
clinical %>%
  datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

1.3从XML clinical data中抓取数据
1）使用GDCquery 和 GDCDownload 函数来查找和下载biospecimen 或 clinical XML files ；
GDCquery参数：

?project	A list of valid project (see table below)
data.category	A valid project (see list with TCGAbiolinks:::getProjectSummary(project))
data.type	A data type to filter the files to download
workflow.type	GDC workflow type
legacy	Search in the legacy repository
access	Filter by access type. Possible values: controlled, open
platform	Example:
	CGH- 1x1M_G4447A	IlluminaGA_RNASeqV2
	AgilentG4502A_07	IlluminaGA_mRNA_DGE
	Human1MDuo	HumanMethylation450
	HG-CGH-415K_G4124A	IlluminaGA_miRNASeq
	HumanHap550	IlluminaHiSeq_miRNASeq
	ABI	H-miRNA_8x15K
	HG-CGH-244A	SOLiD_DNASeq
	IlluminaDNAMethylation_OMA003_CPI	IlluminaGA_DNASeq_automated
	IlluminaDNAMethylation_OMA002_CPI	HG-U133_Plus_2
	HuEx- 1_0-st-v2	Mixed_DNASeq
	H-miRNA_8x15Kv2	IlluminaGA_DNASeq_curated
	MDA_RPPA_Core	IlluminaHiSeq_TotalRNASeqV2
	HT_HG-U133A	IlluminaHiSeq_DNASeq_automated
	diagnostic_images	microsat_i
	IlluminaHiSeq_RNASeq	SOLiD_DNASeq_curated
	IlluminaHiSeq_DNASeqC	Mixed_DNASeq_curated
	IlluminaGA_RNASeq	IlluminaGA_DNASeq_Cont_automated
	IlluminaGA_DNASeq	IlluminaHiSeq_WGBS
	pathology_reports	IlluminaHiSeq_DNASeq_Cont_automated
	Genome_Wide_SNP_6	bio
	tissue_images	Mixed_DNASeq_automated
	HumanMethylation27	Mixed_DNASeq_Cont_curated
	IlluminaHiSeq_RNASeqV2	Mixed_DNASeq_Cont
file.type	To be used in the legacy database for some platforms, to define which file types to be used.
barcode	A list of barcodes to filter the files to download
experimental.strategy	Filter to experimental stratey. Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array. Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
sample.type	A sample type to filter the files to download

2）使用 GDCprepare_clinic 函数来解析 XML files；
3）通过参数clinical.info来解析病例个人数据表中的某一个类别的信息。

data.category	clinical.info
Clinical	drug
Clinical	admin
Clinical	follow_up
Clinical	radiation
Clinical	patient
Clinical	stage_event
Clinical	new_tumor_event
Biospecimen	sample
Biospecimen	bio_patient
Biospecimen	analyte
Biospecimen	aliquot
Biospecimen	protocol
Biospecimen	portion
Biospecimen	slide
Other	msi

例如下面列举的抓取几种数据类别：

query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Clinical", 
                  file.type = "xml", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")

clinical %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")

clinical.drug %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")

clinical.radiation %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")

clinical.admin %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

1.4 Microsatellite data
1.5 Tissue slide image (SVS format)
1.6 Diagnostic Slide (SVS format)
1.7 Pathology report (PDF)
1.8 Tissue slide image (SVS format)
1.9 Clinical Supplement (XML format)
1.10 Clinical data (Biotab format)

1.11 Filter functions

1）TCGAquery_SampleTypes
The function TCGAquery_SampleTypes will filter barcodes based on a type the argument typesample；

Argument	Description
barcode	is a list of samples as TCGA barcodes
typesample	a character vector indicating tissue type to query. Example:
	TP	PRIMARY SOLID TUMOR
	TR	RECURRENT SOLID TUMOR
	TB	Primary Blood Derived Cancer-Peripheral Blood
	TRBM	Recurrent Blood Derived Cancer-Bone Marrow
	TAP	Additional-New Primary
	TM	Metastatic
	TAM	Additional Metastatic
	THOC	Human Tumor Original Cells
	TBM	Primary Blood Derived Cancer-Bone Marrow
	NB	Blood Derived Normal
	NT	Solid Tissue Normal
	NBC	Buccal Cell Normal
	NEBV	EBV Immortalized Normal

例子如下：

bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")

S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")

# Retrieve multiple tissue types  NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))

# Retrieve multiple tissue types  FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))

四、如何使用TCGAbiolinks下载和预处理处理数据以便于分析

1. 下载方式：有两种方式从GDC下载
a. 客户端：通过创建MANIFEST文件建立任务，稳定但速度较api慢；

[GDC Data Transfer Tool]
(https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)

b. API：通过创建MANIFEST文件建立任务，数据下载后打包成tar.gz ，如果数据过大容易出现打包错误中断需要重新执行，可以使用参数 files.per.chunk 将大文件拆分成块。

[GDC Application Programming Interface (API)]
(https://gdc.cancer.gov/developers/gdc-application-programming-interface-api)

2. 数据预处理

2.1 通过SummarizedExperiment对象可以获得三个主要矩阵；

1） Sample matrix：via colData(data): stores sample information

2） Assay matrix ： via assay(data): stores molecular data

3） Feature matrix ：via rowRanges(data): stores metadata about the features, including their genomic ranges

2.2 参数
1）GDCdownload

| Argument |： Description   
| query |：GDCquery 查询函数；
| token.file | ：文件标记后下载数据(仅适用于 method = "client") ；
| method | ：选择下载方式： "api"或"client"；
| directory | ：指定数据下载路径 目录/文件夹， 默认路径: GDCdata ；
| files.per.chunk | ：使用API方式下载时设定同时下载的数量(例如 files.per.chunk = 6) ，会减少下载数据量过大时出现问题的可能；

2）GDCprepare

| Argument | ：Description 
| query |：GDCquery 查询函数；
| save | ：是否作为RData对象保存；
| save.filename | ：指定数据保存的文件名，默认则自动创建；
| directory | ：指定数据下载路径 目录/文件夹， 默认路径: GDCdata ；
| summarizedExperiment | ：是否创建一个summarizedExperiment对象，默认 TRUE (if possible) ；
| remove.files.prepared | ：删除读取的文件，默认: FALSE （这个参数在| save | 参数设为TRUE时使用）；
| add.gistic2.mut |： If a list of genes (gene symbol) is given, columns with gistic2 results from GDAC firehose (hg19) and a column indicating if there is or not mutation in that gene (hg38) (TRUE or FALSE - use the MAF file for more information) will be added to the sample matrix in the summarized Experiment object. 
| mut.pipeline |： If add.gistic2.mut is not NULL this field will be taken in consideration. Four separate variant calling pipelines are implemented for GDC data harmonization. Options: muse, varscan2, somaticsniper, MuTect2. For more information: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/ ；
| mutant_variant_classification | ：List of mutant_variant_classification that will be consider a sample mutant or not. Default: "Frame_Shift_Del", "Frame_Shift_Ins", "Missense_Mutation", "Nonsense_Mutation", "Splice_Site", "In_Frame_Del", "In_Frame_Ins", "Translation_Start_Site", "Nonstop_Mutation" ；

2.3 使用GDC api从legacy database遗产数据库中查找下载

例如从 harmonized database (data aligned against genome of reference hg38)下载gene expression quantification：

query <- GDCquery(project = "TCGA-GBM",
                           data.category = "Gene expression",
                           data.type = "Gene expression quantification",
                           platform = "Illumina HiSeq", 
                           file.type  = "normalized_results",
                           experimental.strategy = "RNA-Seq",
                           barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                           legacy = TRUE)
GDCdownload(query, method = "api", files.per.chunk = 10)
data <- GDCprepare(query)

data <- gbm.exp.legacy

# Gene expression aligned against hg19.
datatable(as.data.frame(colData(data)), 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = FALSE)
# Only first 100 to make render faster
datatable(assay(data)[1:100,], 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = TRUE)

rowRanges(data)

2.4 GDCprepare: Outputs
这个函数正在开发中；
在这里插入图片描述
1）Harmonized data

Data.category	Data.type	Workflow Type	Status
Transcriptome Profiling	Gene Expression Quantification	HTSeq - Counts	Data frame or SE (losing 5% of information when mapping to genomic regions)
		HTSeq - FPKM-UQ	Returning only a (losing 5% of information when mapping to genomic regions)
		HTSeq - FPKM	Returning only a (losing 5% of information when mapping to genomic regions)
	Isoform Expression Quantification	Not needed
	miRNA Expression Quantification	Not needed	Returning only a dataframe for the moment
Copy number variation	Copy Number Segment		Returning only a dataframe for the moment
	Masked Copy Number Segment		Returning only a dataframe for the moment
	Gene Level Copy Number Scores		Returning only a dataframe for the moment
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen	Slide Image
Biospecimen	Biospecimen Supplement
Clinical

2）Legacy data

Data.category	Data.type	Platform	file.type	Status
Transcriptome Profiling
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg18.seg	Working
	-	Affymetrix SNP Array 6.0	hg18.seg	Working
	-	Affymetrix SNP Array 6.0	nocnv_hg19.seg	Working
	-	Affymetrix SNP Array 6.0	hg19.seg	Working
	-	Illumina HiSeq	Several	Working
Simple Nucleotide Variation	Simple somatic mutation
Raw Sequencing Data
Biospecimen
Clinical
Protein expression		MDA RPPA Core	-	Working
Gene expression	Gene expression quantification	Illumina HiSeq	normalized_results	Working
		Illumina HiSeq	results	Working
		HT_HG-U133A	-	Working
		AgilentG4502A_07_2	-	Data frame only
		AgilentG4502A_07_1	-	Data frame only
		HuEx-1_0-st-v2	FIRMA.txt	Not Preparing
			gene.txt	Not Preparing
	Isoform expression quantification
	miRNA gene quantification
	Exon junction quantification
	Exon quantification
	miRNA isoform quantification

DNA methylation		Illumina Human Methylation 450	Not used	Working
		Illumina Human Methylation 27	Not used	Working
		Illumina DNA Methylation OMA003 CPI	Not used	Working
		Illumina DNA Methylation OMA002 CPI	Not used	Working
		Illumina Hi Seq		Not working
Raw Microarray Data
Structural Rearrangement
Other

2.5 完整的例子（https://gist.github.com/tiagochst/a701bad3fa3800ade7063760755e0aad）

# ------------------------------------------------------------------
# Updating TCGAbiolinks to work with GDC data
# --------------------------------------------------------------------
# Install last version from the github (this is a development version)
devtools::install_github("BioinformaticsFMRP/TCGAbiolinks")
library(TCGAbiolinks)


####################### Working harmonized data ###########################
#                   Data.category: clinical and biospecimen
############################################################################
# Clinical information
# https://gdc.nci.nih.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization
clin <- GDCquery_clinic("TCGA-ACC", type = "clinical", save.csv = TRUE)
clin <- GDCquery_clinic("TCGA-ACC", type = "biospecimen", save.csv = TRUE)

#-----------------------------------------------------------------------------
#                   Data.category: MAF files
#-----------------------------------------------------------------------------
mut <- GDCquery_Maf(tumor = "ACC")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "onco.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        heatmap.legend.side = "right",
                        dist.col = 0,
                        label.font.size = 10)


#-----------------------------------------------------------------------------
#                   Data.category: Copy number variation
#-----------------------------------------------------------------------------
query <- GDCquery(project = "TCGA-ACC",
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)

query <- GDCquery("TCGA-ACC",
                  "Copy Number Variation",
                  data.type = "Masked Copy Number Segment",
                  sample.type = c("Primary solid Tumor")) # query$results[[1]]$cases
GDCdownload(query)
data <- GDCprepare(query)

#-----------------------------------------------------------------------------
#                   Data.category: Transcriptome Profiling
#-----------------------------------------------------------------------------
workflow.type <- c("HTSeq - Counts", "HTSeq - FPKM","HTSeq - FPKM-UQ")
for(i in workflow.type){
    print(i)
    query <- GDCquery(project = "TARGET-AML",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = i,
                      barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R"))
    GDCdownload(query)
    data <- GDCprepare(query)
}
#data.type <- c("miRNA Expression Quantification","Isoform Expression Quantification")
data.type <- c("miRNA Expression Quantification")
for(i in data.type){
    print(i)
    query <- GDCquery(project = "TARGET-AML",
                      data.category = "Transcriptome Profiling",
                      data.type = i,
                      workflow.type = "BCGSC miRNA Profiling",
                      barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R"))
    GDCdownload(query)
    data <- GDCprepare(query)
    print(head(data))
}


####################### Working with Legacy data ###########################
#                   Data.category: Copy number variation
############################################################################
query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "nocnv_hg19.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)

query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "nocnv_hg18.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)


query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "hg19.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)

query <- GDCquery(project = "TCGA-LGG",
                  barcode = c("TCGA-HT-7476-10A-01D-2022-02", "TCGA-FG-6689-01A-11D-1891-02"),
                  data.category =  "Copy number variation", platform = "Illumina HiSeq", legacy = TRUE)
GDCdownload(query)
z <- GDCprepare(query)

query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "hg18.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)
####################### Working with Legacy data ###########################
#                   Data.category: DNA methylation & Protein expression
############################################################################
# Function to get two samples to test the function
legacyPipeline <- function(project, data.category, platform){
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE)
    cases <- query$results[[1]]$cases[1:2]
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE,
                      barcode = cases)
    GDCdownload(query)
    data <- GDCprepare(query)
    return(data)
}
# DNA methylation
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 27")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 450")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA003 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")

# Protein expression
data <- legacyPipeline("TCGA-GBM","Protein expression","MDA_RPPA_Core")

五、如何使用TCGAbiolinks查找下载来可视化mutation data

1. 查找和下载
1.1 TCGAbiolinks 提供两种方式从GDC下载mutation data：
1）Use GDCquery_Maf which will download MAF (mutation annotation files) aligned against hg38；
2）Use GDCquery, GDCdownload and GDCpreprare to downoad MAF aligned against hg19；

1.2 Mutation data (hg38)

maf <- GDCquery_Maf("CHOL", pipelines = "muse")

maf <- chol_maf@data

# Only first 50 to make render faster
datatable(maf[1:20,],
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)

Pipelines options：muse, varscan2, somaticsniper, mutect.

1.3 Mutation data (hg38)

query.maf.hg19 <- GDCquery(project = "TCGA-CHOL", 
                           data.category = "Simple nucleotide variation", 
                           data.type = "Simple somatic mutation",
                           access = "open", 
                           legacy = TRUE)
                           
# Check maf availables
datatable(dplyr::select(getResults(query.maf.hg19),-contains("cases")),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 10), 
          rownames = FALSE)
          
query.maf.hg19 <- GDCquery(project = "TCGA-CHOL", 
                           data.category = "Simple nucleotide variation", 
                           data.type = "Simple somatic mutation",
                           access = "open", 
                           file.type = "bcgsc.ca_CHOL.IlluminaHiSeq_DNASeq.1.somatic.maf",
                           legacy = TRUE)
GDCdownload(query.maf.hg19)
maf <- GDCprepare(query.maf.hg19)

data <- bcgsc.ca_CHOL.IlluminaHiSeq_DNASeq.1.somatic.maf

# Only first 50 to make render faster
datatable(maf[1:20,],
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)

1.4 可视化数据
使用Bioconductor包maftools；

library(maftools)
library(dplyr)
maf <- GDCquery_Maf("CHOL", pipelines = "muse") %>% read.maf

library(maftools)
library(dplyr)
maf <- chol_maf

datatable(getSampleSummary(maf),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
plotmafSummary(maf = maf, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE)

oncoplot(maf = maf, top = 10, removeNonMutated = TRUE)
titv = titv(maf = maf, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = titv)

Demo Code

R Code

#-----------------------------------------------------------------------------
#                  安装TCGAbiolinks
#-----------------------------------------------------------------------------
#安装稳定版biolinks
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.10")
BiocManager::install("TCGAbiolinks")

#安装开发版biolinks
#devtools::install_github('BioinformaticsFMRP/TCGAbiolinks')

#-----------------------------------------------------------------------------
#                   手动安装示例区域（自己看着办）
#-----------------------------------------------------------------------------
#常规联网查找安装，
install.packages("example_name")
#通过bioconductor资源安装
BiocManager::install("example_name")
#to search through available packages programmatically
BiocManager::available("example_name")
#to flag packages that are either out-of-date or too new for your version of Bioconductor. 
BiocManager::valid()
#-----------------------------------------------------------------------------
#                   加载库
#-----------------------------------------------------------------------------
#biolinks使用
library(SummarizedExperiment)
library(TCGAbiolinks)

#分析绘图使用
BiocManager::install("edgeR")
install.packages("gplots")
BiocManager::install("ggplot2")
library(edgeR)
library(gplots)
library(ggplot2)
require(gplot2)
#-----------------------------------------------------------------------------
#                   创建工作目录
#-----------------------------------------------------------------------------
#创建新目录
dir.create("c:/TCGA");
dir.create("c:/TCGA/CESC");
#目录
work_dir <- "c:/TCGA/CESC"
#将当前目录指向工作目录,并返回显示
setwd(work_dir)
getwd()

#-----------------------------------------------------------------------------
#                   创建全局变量（根据实际填写，非全部必须）
#-----------------------------------------------------------------------------
##### GDCquery #####
project <- "TCGA-CESC"
data_category <- "Transcriptome Profiling"
data_type <- "Gene Expression Quantification"
workflow_type <- "HTSeq - Counts"
# bar <- c()
legacy <- FALSE

##### GDCdownload #####
# TokenFile <- 
method <- 'api'
download_dir <- paste0(work_dir,"/GDC/",gsub("-","_",project))
FilesChunk <- 6
  
##### GDCprepare #####
save <- TRUE
SaveName <- paste0(download_dir,"_","RNAsqe_HTSeq_Counts",".rda")
# summarizedExperiment <- FASLSE
  
#-----------------------------------------------------------------------------
#                   查询(根据实际情况调整参数)
#-----------------------------------------------------------------------------
query <- GDCquery(project = project,
                  data.category = data_category,
                  data.type = data_type,
                  workflow.type = workflow_type,
                  legacy = legacy 
                  )

#-----------------------------------------------------------------------------
#                   下载(根据实际情况调整参数)
#-----------------------------------------------------------------------------
GDCdownload(query = query,
            directory = download_dir,
            files.per.chunk = FilesChunk,
            method = method
            )

#-----------------------------------------------------------------------------
#                   预处理(根据实际情况调整参数)
#-----------------------------------------------------------------------------
data <- GDCprepare(query,
                   directory = download_dir,
                   save = save,
                   save.filename = SaveName
                   )

#-----------------------------------------------------------------------------
#                   提取
#-----------------------------------------------------------------------------
#获取表达矩阵
data_expr <- assay(data)

#查看表达矩阵的维度
dim(data_expr)

#创建表达矩阵文件路径expr_file
expr_file <- paste0(download_dir,"_","All_HTSeq_Counts",".txt")

#输出到表达矩阵文件中,file参数为""可以输出到console中
write.table(data_expr,file = expr_file,sep = "\t",row.names = T,quote = F)

#提取Gene表达量矩阵
gene_info = read.table("C:/Jagger/fannie/test/gene_info.txt",header = 1)
cat("Total Gene annotation:",dim(gene_info)[1])
gene_selected <- row.names(data_expr)%in%gene_info$Gene_id
gene_expr <- data_expr[gene_selected,]
cat("Total Gene Selected:",dim(gene_expr)[1])
gene_expr_file <- paste0(download_dir,"_","Gene_HTSeq_Counts",".txt")
write.table(gene_expr,file = gene_expr_file,sep="\t",row.names = T,quote = F)

#-----------------------------------------------------------------------------
#                   分析
#-----------------------------------------------------------------------------

##### Terminal模式中运行时使用此交互 #####
args <- commandArgs(TRUE)
working_dir <- args[1]
rawdata_file <- args[2]
sample_info_file <- args[3]

##### 编辑器中使用 #####
#工作目录
working_dir <- "C:/TCGA/Analysis"
#基因表达矩阵文件路径
rawdata_file <- "C:/TCGA/CESC/GDC/TCGA_CESC_Gene_HTSeq_Counts.txt"
#
#sample_info_file <- 
  
##### 设置工作目录并返回 #####
setwd(working_dir)
getwd()
  
#设置FDR和 fold change 阈值
fdr = 0.01
fold_change = 2

#读取基因表达矩阵
rawdata=read.table(rawdata_file,header = TRUE,stringsAsFactors = FALSE,row.names = 1)

#原始数据的基因数和样本量
dim(rawdata)

#针对基因的表达量进行过滤，过滤标准设置为：至少有25%的样本，基因的表达量大于2
quant <- apply(rawdata,1,quantile,0.75)
keep <- which((quant >= 2) == 1)
rawdata <- rawdata[keep,]
dim(rawdata)

#样分组信息，可以通过文件传入，也可以采用下面的简单方法，基于barcode判断是癌症还是正常组织
#sample_info =read.table(sample_info_file,header=TRUE,stringAsFactors=FALSE,row.names=1)

#基于样本的编号，根据第14位判断是癌症还是正常样本
#0为癌症样本，1为正常样本

#Barcode 实例： TCGA-W5-AA34-11A-11R-A41I-07

group <- factor(substr(colnames(rawdata),14,14))
summary(group)

#获得基因名称列表
genes=rownames(rawdata)

#制作DEGlist对象
y <- DGEList(counts=rawdata,genes=genes,group=group)
#计算有多少行
nrow(y)

#TMM标准化（基于count数据的标准化）
y <- calcNormFactors(y)

#分布估计（这步较慢）
y <- estimateCommonDisp(y,verbose=TRUE)
y <- estimateTagwiseDisp(y)

#差异分析检验（T检验）
et <- exactTest(y)

#针对FDR，fold change 对基因的显著性表达进行‘多重检验’
#对p value 进行BH检验
et$table$FDR <- p.adjust(et$table$PValue,method = "BH")
#summary显示共上调多少基因，下调多少基因（数据较大则重新调整FDR）
summary(de <- decideTestsDGE(et,adjust.method = "BH",p.value = fdr, lfc = log2(fold_change)))

#标识基因的上下调情况
et$table$regulate = as.factor(ifelse(et$table$FDR < fdr & abs(et$table$logFC) >=log2(fold_change),ifelse(et$table$logFC>log2(fold_change),'Up','Down'),'Normal'))

summary(et$table$regulate)

#绘制前调整右下角大小，过小会报错
plotMD(et)
print('et:')
et

#筛选出差异表达的基因
diff_expr_out <- et$table[et$table$regulate !='Normal',]

#输出结果

#保存结果
write.table(diff_expr_out,file = "DE_genes.txt",quote = FALSE, row.names = T,sep = "\t")

#绘制火山图
#设置FDR，Fold_Change 的辅助线
fdr_line = -log10(fdr)
fc_line = log2(fold_change)

#采用ggplot进行绘图
gp1 <- ggplot(et$table,aes(x=logFC,y=-log10(FDR),colour=regulate))+xlab("log2Fold Change")+ylab("-log10 FDR")+ylim(0,30)
#基于上下调关系，制定不同的颜色
gp2 <- gp1 + geom_point() +scale_color_manual(values = c("green","black","red"))
#增加两条辅助线
gp3 <- gp2 + geom_hline(yintercept = fdr_line,linetype=4)+geom_vline(xintercept = c(-fc_line,fc_line),linetype=4)

#run gp3可直接得到图
gp3

#保存图片
ggsave("DE_gene_Vocano.pdf",width = 16,height = 9)

##################
#绘制聚类热图
normData=y$pseudo.counts
heatmapData <- normData[rownames(diff_expr_out),]

#有些表达量为0，无法进行log转换，增加一个小背景：0.001
heatmapExp=log2(heatmapData+0.001)
heatmapMat=as.matrix(heatmapExp)

#输出的热图的储存途径
pdf(file = "DE_gene_heatmap.pdf",width = 60,height = 90)
par(oma=c(10,3,3,7))
heatmap.2(heatmapMat,col='greenred',trace='none')
dev.off()

#样品相关性绘图
pdf(file="gene_cor_cluster2.pdf",width = 60,height = 90)
par(oma=c(10,3,3,7))

#基于pearson相关性对样品进行聚类
heatmap(cor(normData))
dev.off()

#-----------------------------------------------------------------------------
#                   PPI
#-----------------------------------------------------------------------------
# 安装STRINGdb 软件包
BiocManager::install("STRINGdb")
BiocManager::install("rlist")
############################################################
library(STRINGdb)
library("rlist")
# 设置程序参数
work_dir <- "C:/TCGA/CESC/Analysis" 
deg_file <- "C:/TCGA/CESC/Analysis/DE_genes.txt"
setwd(work_dir)


# 获取物种的分类编号
# get_STRING_species(version="10", species_name=NULL) 
# NCBI数据库中9606 代表人类，700为可信度分值
#version已更新到11，但只有10不报错
string_db <- STRINGdb$new(version="10", species=9606,score_threshold=700, input_directory= work_dir)

# 读取差异表达的文件，获得差异表达基因列表
degs = read.table(deg_file,header=T,comment.char = "",check.names=F)
degs$gene <- rownames(degs)
head(degs)

# 查看有多少差异表达的基因需要分析 
cat("Total deg genes:", dim(degs)[1])

# 将基因的ID map 到string 数据库中， 不一定每个基因都能map上，很慢很慢
deg_mapped <- string_db$map( degs, "gene", removeUnmappedRows = TRUE )

# 查看有多少ID map 上了 
cat("Total String id mapped :", dim(deg_mapped)[1])

#-----------------------------------------------------------------------------
#                   GO KEGG
#-----------------------------------------------------------------------------
BiocManager::install("org.Hs.eg.db")
BiocManager::install("clusterProfiler")
BiocManager::install("pathview")
BiocManager::install("topGO")
install.packages("SparseM")
BiocManager::install("enrichMap")

library(org.Hs.eg.db)
library(clusterProfiler)
library(pathview)
require(pathview)
library(topGO)

#args <- commandArgs(TRUE)
#work_dir <-  args[1]
#deg_file <- args[2]


work_dir <- "C:/TCGA/Annotation"
deg_file <- "C:/TCGA/Analysis/DE_genes.txt"
setwd(work_dir)

# 读取差异表达的文件，获得差异表达基因列表
degs = read.table(deg_file,header=T,comment.char = "",check.names=F)
DEG_list <- rownames(degs)
head(DEG_list)
#GO 富集
# ont:  MF,BP,CC
ego <- enrichGO(gene          = DEG_list,
                keyType       = "ENSEMBL",
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

write.table(ego, file = "GO_enrichment_stat.txt",sep="\t", row.names =F, quote = F)


ego_results<-summary(ego)
ego_results
# 绘制相关图
pdf(file = "ego_barplot.pdf")
barplot(ego, showCategory=20, x = "GeneRatio")
dev.off()

# 点图
pdf(file = "ego_dotplot.pdf")
dotplot(ego,showCategory=20)
dev.off()

pdf(file = "enrich_map.pdf")
enrichMap(ego)
dev.off()

# 绘制topGO图
pdf(file = "topgo.pdf")
plotGOgraph(ego)
dev.off()


# KEGG pathway 富集
gene_ids <- bitr(DEG_list, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
kegg <- enrichKEGG(gene         = gene_ids$ENTREZID,
                   organism     = 'hsa',
                   pvalueCutoff = 0.05)
head(kegg)
write.table(kegg, file = "KEGG_enrichment_stat.txt",sep="\t", row.names =F, quote = F)

# 点图
png(file = "kegg_dotplot.png")
dotplot(kegg,title="Enrichment KEGG_dot")
dev.off()

# 合并表达信息和基因信息
deg_info <- degs['logFC']
deg_info['ENSEMBL'] <- rownames(deg_info)
gene_ids_merge <- merge(gene_ids, deg_info,by='ENSEMBL')
# 去掉ENTREZID 重复的基因
index<-duplicated(gene_ids_merge$ENTREZID)
gene_ids_merge<-gene_ids_merge[!index,]

# 提取FC的值，在map上进行颜色标注
map_ids <- as.matrix(gene_ids_merge['logFC'])
rownames(map_ids) <- gene_ids_merge$ENTREZID

                
#  绘制富集的pathway 以前三个为例
pathway_id <- "hsa04110"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa03030"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04022"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04115"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04151"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04010"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)

###第二种：MF
ego <- enrichGO(gene          = DEG_list,
                keyType       = "ENSEMBL",
                OrgDb         = org.Hs.eg.db,
                ont           = "MF",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

write.table(ego, file = "GO_enrichment_stat.txt",sep="\t", row.names =F, quote = F)
ego_results<-summary(ego)
ego_results
# 绘制相关图
pdf(file = "ego_barplot_MF.pdf")
barplot(ego, showCategory=20, x = "GeneRatio")
dev.off()
# 点图
pdf(file = "ego_dotplot_MF.pdf")
dotplot(ego,showCategory=20)
dev.off()
#效果不好
#pdf(file = "enrich_map.pdf")
#enrichMap(ego)
#dev.off()
# 绘制topGO图
pdf(file = "topgo_MF.pdf")
plotGOgraph(ego)
dev.off()

###第三种：CC
ego <- enrichGO(gene          = DEG_list,
                keyType       = "ENSEMBL",
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

write.table(ego, file = "GO_enrichment_stat.txt",sep="\t", row.names =F, quote = F)
ego_results<-summary(ego)
ego_results
# 绘制相关图
pdf(file = "ego_barplot_CC.pdf")
barplot(ego, showCategory=20, x = "GeneRatio")
dev.off()
# 点图
pdf(file = "ego_dotplot_CC.pdf")
dotplot(ego,showCategory=20)
dev.off()
#效果不好
#pdf(file = "enrich_map.pdf")
#enrichMap(ego)
#dev.off()
# 绘制topGO图
pdf(file = "topgo_CC.pdf")
plotGOgraph(ego)
dev.off()

Perl Code

#!/usr/bin/perl -w
use strict;

## Perl Code
my $biotype_file = "biotype_classs.txt";
my $gtf = "gencode.v33.annotation.gtf";
my $biotype = "gene";


my %biotype_list;
open my $fh1, $biotype_file or die;
while (<$fh1>) {
    chomp;
    my @array = split /\t/, $_;
#    print "$array[2], $array[0]\n";
    if($array[2]eq $biotype){
        $biotype_list{$array[0]} = 1;
    }
}
close $fh1;


open my $out, ">${biotype}_info.txt" or die;
print $out "Gene_id\tGene_id_info\tgene_name\tbiotype\n";
open my $fh2, $gtf or die;
while (<$fh2>) {
    chomp;
    next if /^#/;
    my @array = split /\t/, $_;
    next unless ($array[2] eq "gene");
    $array[8] =~ /gene_id\s+"(\S+?)";.*gene_type\s+"(\S+?)";.*gene_name\s+"(\S+?)";/;
    my $geneid = $1;
    my $genebiotype = $2;
    my $genename = $3;
    my $gene_id_norm=(split("\\.",$geneid))[0];
#    print "$genebiotype\n";
    if ($biotype_list{$genebiotype}) {
        print $out "$gene_id_norm\t$geneid\t$genename\t$genebiotype\n";
    }
}
close $fh2;

my $biotype_file = "biotype_classs.txt";
my $gtf = "gencode.v33.annotation.gtf";
my $biotype = "test";

print ($biotype_file);

my %biotype_list;
open (my $fh1, "<$biotype_file") or die;
while (<$fh1>) {
    chomp;
    my @array = split /\t/, $_;
#    print "$array[2], $array[0]\n";
    if($array[2]eq $biotype){
        $biotype_list{$array[0]} = 1;
    }
}
close $fh1;


open my $out, ">${biotype}_info.txt" or die;
print $out "Gene_id\tGene_id_info\tgene_name\tbiotype\n";
open my $fh2, $gtf or die;
while (<$fh2>) {
    chomp;
    next if /^#/;
    my @array = split /\t/, $_;
    next unless ($array[2] eq "gene");
    $array[8] =~ /gene_id\s+"(\S+?)";.*gene_type\s+"(\S+?)";.*gene_name\s+"(\S+?)";/;
    my $geneid = $1;
    my $genebiotype = $2;
    my $genename = $3;
    my $gene_id_norm=(split("\\.",$geneid))[0];
#    print "$genebiotype\n";
    if ($biotype_list{$genebiotype}) {
        print $out "$gene_id_norm\t$geneid\t$genename\t$genebiotype\n";
    }
}
close $fh2;