whole-genome-sequencing Data Analysis 学习笔记4 变异位点注释数据库

最新推荐文章于 2024-06-02 09:42:15 发布

Cs_mary

最新推荐文章于 2024-06-02 09:42:15 发布

阅读量3.4k

点赞数

CC 4.0 BY-SA版权

分类专栏： BioInfo 文章标签：数据库

本文链接：https://blog.youkuaiyun.com/Cs_mary/article/details/58585283

BioInfo 专栏收录该内容

30 篇文章

订阅专栏

本文介绍了全基因组测序数据的分析，重点关注变异位点注释数据库的重要性和作用。内容涉及sftp命令的使用、数据库在疾病预测中的价值、VCF文件格式详解，以及如何利用TCGA、1000Genomes、ExAC、dbSNP、ESP6500等数据库下载和解析基因型数据。同时，文章提及了不同数据库的特点和用途，如ExAC数据库提供60,706个个体的序列数据，而dbSNP提供了大量的人类基因变异数据。" 131220528,19211868,Cookie API封装实践：快递API接口示例,"['Java', '开发语言', 'Web开发', 'HTTP']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

linux下如何使用sftp命令
sftp> get /var/www/fuyatao/index.php /home/fuyatao/
这条语句将从远程主机的 /var/www/fuyatao/目录下将 index.php 下载到本地 /home/fuyatao/目录下。

sftp> put /home/fuyatao/downloads/Linuxgl.pdf /var/www/fuyatao/
这条语句将把本地 /home/fuyatao/downloads/目录下的 linuxgl.pdf文件上传至远程主机/var/www/fuyatao/ 目录下。

卸载软件：sudo apt-get autoremove –purge +软件名

变异数据库：对于找有意义的变异位点、疾病预测等方面有着重要作用

通常个体的全基因组测序数据可以挖掘到四百万个SNVs(跟参考基因组不一样的单碱基位点)
还有五十万的indels(insertions or deletions)
得到的数据通常是以vcf文件格式给出的

vcf格式
CHROM：表示变异位点是在哪个contig 里call出来的，如果是人类全基因组的话那就是chr1…chr22，chrX,Y,M。

POS：变异位点相对于参考基因组所在的位置，如果是indel，就是第一个碱基所在的位置。

ID：如果call出来的SNP存在于dbSNP数据库里，就会显示相应的dbSNP里的rs编号。

REF和REF：在这个变异位点处，参考基因组中所对应的碱基和研究对象基因组中所对应的碱基。

QUAL：可以理解为所call出来的变异位点的质量值。Q=-10lgP，Q表示质量值；P表示这个位点发生错误的概率。因此，如果想把错误率从控制在90%以上，P的阈值就是1/10，那lg（1/10）=-1，Q=（-10）*（-1）=10。同理，当Q=20时，错误率就控制在了0.01。

FILTER：理想情况下，QUAL这个值应该是用所有的错误模型算出来的，这个值就可以代表正确的变异位点了，但是事实是做不到的。因此，还需要对原始变异位点做进一步的过滤。无论你用什么方法对变异位点进行过滤，过滤完了之后，在FILTER一栏都会留下过滤记录，如果是通过了过滤标准，那么这些通过标准的好的变异位点的FILTER一栏就会注释一个PASS，如果没有通过过滤，就会在FILTER这一栏提示除了PASS的其他信息。如果这一栏是一个“.”的话，就说明没有进行过任何过滤。

#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

chr1 873762
GT: AD: DP: GQ: PL
0/1: 173,141: 282: 99: 255,0,255

其中最后面两列是相对应的，每一个tag对应一个或者一组值，如：

chr1：873762，
GT对应0/1；AD对应173,141；DP对应282；GQ对应99；PL对应255,0,255。

GT：表示这个样本的基因型，对于一个二倍体生物，GT值表示的是这个样本在这个位点所携带的两个等位基因。0表示跟REF一样；1表示表示跟ALT一样；2表示第二个ALT。当只有一个ALT 等位基因的时候，0/0表示纯和且跟REF一致；0/1表示杂合，两个allele一个是ALT一个是REF；1/1表示纯和且都为ALT；

0 - reference call
1 - alternative call 1
2 - alternative call 2

AD：对应两个以逗号隔开的值，这两个值分别表示覆盖到REF和ALT碱基的reads数，相当于支持REF和支持ALT的测序深度。

DP： 覆盖到这个位点的总的reads数量，相当于这个位点的深度（并不是多有的reads数量，而是大概一定质量值要求的reads数）。

PL: 对应3个以逗号隔开的值，这三个值分别表示该位点基因型是0/0，0/1，1/1的没经过先验的标准化Phred-scaled似然值（L）。如果转换成支持该基因型概率（P）的话，由于L=-10lgP，那么P=10^（-L/10），因此，当L值为0时，P=10^0=1。因此，这个值越小，支持概率就越大，也就是说是这个基因型的可能性越大。

GQ：表示最可能的基因型的质量值。表示的意义同QUAL。

举个例子说明一下：

chr1 899282 rs28548431 C T [CLIPPED]

GT: AD: DP: GQ: PL
0/1: 1,3: 4: 25.92: 103,0,26（该位点的基因型0/0，0/1，1/1）

在这个位点，GT=0/1，也就是说这个位点的基因型是C/T；
GQ=25.92，质量值并不算太高，可能是因为cover到这个位点的reads数太少，DP=4，也就是说只有4条reads支持这个地方的变异；
AD=1,3，也就是说支持REF的read有一条，支持ALT的有3条；
在PL里，这个位点基因型的不确定性就表现的更突出了，
0/1的PL值为0，虽然支持0/1的概率很高；
但是1/1的PL值只有26，也就是说还有10^(-2.6)=0.25%的可能性是1/1；但几乎不可能是0/0，因为支持0/0的概率只有10^(-10.3)=5*10-11

TCGA数据库下载脚本解析：
wget https://gdc-docs.nci.nih.gov/Data/Release_Notes/Manifests/GDC_open_MAFs_manifest.txt
for i in cut -f 2 GDC_open_MAFs_manifest.txt
do
echo $i
adress=echo $i |cut -d'.' -f 4

***cut -d “=” -f 2中的cut 是列提取命令，-d是判断文件是否存在的命令
-f表示取第一个字段的值。
如：echo “a/b/c” |cut -d ‘/’ -f 1，执行结果是a。执行过程：先按/分段，分段后结果是：第一个字段是a，第2个字段是b，第3个字段是c，-f就是取第几个字段。*

filename=echo $i |cut -f 2 |cut -d'.' -f 1-3,5-7
echo $adress$ filename
wget -O “$filename” “”
done

for i in /*; do echo $i; find$ i | wc -l; done;
这是一个for循环语句。
首先声明一个变量i ，将/*即根目录下的所有文件的名称全部赋值给刚刚声明的变量i。这一段在编程中叫做循环头。接下来。do 和done之间的部分是循环体。

echo命令的功能是在显示器上显示一段文字，一般起到一个提示的作用。
echo $i的作用是将根目录下所有文件显示出来。

wc -l 的作用是显示文件的行数。
find $i是查找根目录下所有文件。那么这个循环在实际运作时是这样的。首先声明变量i，将/*即根目录下的所有文件的名称全部赋值给刚刚声明的变量i。 echo$ i会将在根目录下的第一个文件名称显示出来，然后执行find $i | wc -l，这样会把根目录下第一个文件的行数显示出来。此时，第一次循环结束了。接着进行第2次循环，此时，echo$ i会将在根目录下的第二个文件名称显示出来，然后执行find $i | wc -l，这样会把根目录下第二个文件的行数显示出来。就这样一次次的循环，直到把根目录下最后一个文件名称和它的行数显示出来。此时，既然已经是最后一个文件了，也就没有下一个了。所以循环的条件已经不能满足了。于是，就跳到done;的后面，for循环结束。

千人基因组计划包括：5个大人种，共25个小人种的基因型数据，将已有基因型文件比较得祖缘分析结果，
下载1000genomes
mkdir -p ~/annotation/variation/human/1000genomes
cd ~/annotation/variation/human/1000genomes
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
nohup wget -c -r -nd -np -k -L -p ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502 &

其他几个国家级的基因组计划，都是针对特定人种的

1.ExAC数据库
mkdir -p ~/annotation/variation/human/ExAC
cd ~/annotation/variation/human/ExAC

http://exac.broadinstitute.org/
ftp://ftp.broadinstitute.org/pub/ExAC_release/current
The Exome Aggregation Consortium (ExAC)外显子数据库 is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.

The data set provided on this website spans 60,706 unrelated individuals sequenced as part of various disease-specific 疾病特异性 and population genetic studies. The ExAC Principal Investigators and groups that have contributed data to the current release are listed here.

/pub/ExAC_release/release0.3.1/ 的索引

名称大小修改日期
[上级目录]
ExAC.r0.3.1.sites.vep.vcf.gz 4.0 GB 16/3/16 上午12:00:00
ExAC.r0.3.1.sites.vep.vcf.gz.tbi 863 kB 16/3/16 上午12:00:00
README.VEP_annotations 161 B 16/3/16 上午12:00:00
README.ftp_structure 4.4 kB 16/3/16 上午12:00:00
README.histogram_annotation 396 B 16/3/16 上午12:00:00
README.known_issues 2.2 kB 16/3/16 上午12:00:00
README.new_annotations 2.7 kB 16/3/16 上午12:00:00
README.population_annotations 5.1 kB 16/3/16 上午12:00:00
README.release0.3.1 715 B 16/3/16 上午12:00:00
cnv/ 16/8/23 上午12:00:00
coverage 0 B 16/3/16 上午12:00:00
freq_filter/ 16/9/2 下午8:48:00
functional_gene_constraint 0 B 16/3/16 上午12:00:00
manuscript_data/ 16/3/30 上午12:00:00
md5sum.txt 533 B 16/3/16 上午12:00:00
resources 0 B 16/3/16 上午12:00:00
subsets/ 16/3/16 上午12:00:00

wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/ExAC.r0.3.1.sites.vep.vcf.gz.tbi
nohup wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/ExAC.r0.3.1.sites.vep.vcf.gz &

/pub/ExAC_release/release0.3.1/cnv 的索引
名称大小修改日期
[上级目录]
README.cnv_bed 896 B 16/8/23 上午12:00:00
README.cnv_gene_scores 1.2 kB 16/8/23 上午12:00:00
exac-final-cnv.gene.scores071316 3.2 MB 16/8/23 上午12:00:00
exac-final.autosome-1pct-sq60-qc-prot-coding.cnv.bed 8.0 MB 16/8/23 上午12:00:00
md5sum.txt

wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/cnv/exac-final-cnv.gene.scores071316
wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/cnv/exac-final.autosome-1pct-sq60-qc-prot-coding.cnv.bed

2.dbSNP数据库

mkdir -p ~/annotation/variation/human/dbSNP
cd ~/annotation/variation/human/dbSNP

https://www.ncbi.nlm.nih.gov/projects/SNP/
List of 53 organisms with variant annotation 53种生物的的变异注释on their genomes available for web search and FTP download.
NCBI Variation Summary
Description:
Summary of variation data by organism available from dbSNP and dbVar.
Please contact dbSNP or dbVar if you have any problems or to submit variant data for your organism.
Report date: Friday, January 27, 2017
Total Variants:
SubSNP count: 1,682,955,638
RefSNP count: 788,704,808
Variant Call count: 36,835,190
Variant Region count: 6,458,655

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/

nohup wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz &
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz.tbi
这个暂时下载不了。。

3.ESP6500数据库
mkdir -p ~/annotation/variation/human/ESP6500
cd ~/annotation/variation/human/ESP6500

http://evs.gs.washington.edu/EVS/
The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms 新基因和机制 contributing to heart, lung and blood disorders 心脏，肺，血紊乱 by pioneering the application of next-generation sequencing of the protein coding regions 蛋白质编码区域 of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.提高治疗

nohup wget http://evs.gs.washington.edu/evs_bulk_data/ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz &

The following downloadable files contain SNPs, INDELs and coverage data for the ESP 6500 exomes外显子 (chromosomes 1-22, X, and Y).
The bulk files of the ESP 6500 exome data below are still primarily主要在 in GRCh37 (or HG19), the GRCh38 lifted-over positions are added in an extra column in the text file, or in an extra attribute in the INFO field in the VCF file.
ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.txt.tar.gz (variants and annotations in space-delimited text format)
ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz (variants and annotations in VCF format - same data as the *.snps_indels.txt.tar.gz listed above)
ESP6500SI-V2.GRCh38-liftover.coverage.all_sites.txt.tar.gz (sequencing coverage data for every site in our targets)
ESP6500SI-V2.coverage.seq_blocks.txt.tar.gz (sequencing coverage summary data for every squencing block in our targets)
agilent_nimblegen_exome_targets_esp_project.tar.gz (the original exome target files used in the ESP project)

4.
mkdir -p ~/annotation/variation/human/UK10K
cd ~/annotation/variation/human/UK10K

http://www.uk10k.org/

nohup wget ftp://ngs.sanger.ac.uk/production/uk10k/UK10K_COHORT/REL-2012-06-02/UK10K_COHORT.20160215.sites.vcf.gz &

mkdir -p ~/annotation/variation/human/gonl
cd ~/annotation/variation/human/gonl

http://www.nlgenome.nl/search/

https://molgenis26.target.rug.nl/downloads/gonl_public/variants/release5/

nohup wget -c -r -nd -np -k -L -p https://molgenis26.target.rug.nl/downloads/gonl_public/variants/release5 &

Singapore Sequencing Malay Project (SSMP)

mkdir -p ~/annotation/variation/human/SSMP
cd ~/annotation/variation/human/SSMP

http://www.statgen.nus.edu.sg/~SSMP/
http://www.statgen.nus.edu.sg/~SSMP/download/vcf/2012_05

不知道有没有下载
Singapore Sequencing Malay

About Our Project 马来人数据库

Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity to comprehensively characterize the polymorphic variants in the population. While the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage inadvertently compromised the ability to confidently detect rare and low-frequency variants.
In addition, the composition of populations in the 1KGP is not complete, despite the extension of the study design to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) 新加坡测序马来计划aims to perform deep whole-genome sequencing of 100 healthy Malays100个健康人.
Sequencing at an average of 30-fold coverage30倍覆盖率, we illustrate the higher sensitivity at detecting low-frequency and rare variants低频和罕见变异, and the ability to investigate the presence of hotspots of functional mutations功能变异. The deeper coverage allows more functional variants to be identified for each person when compared to the low-pass sequencing in 1KGP. This set of whole-genome sequence data is expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population genetic studies. We also expect the high coverage will enable methodological and technological assessments of current strategies in sequence data analysis. 分析方法上也有进步

Sequencing Platform
Illumina HiSeq 2000
Target coverage of 30-fold
Paired-end sequencing 双端
100 base-pairs (bp) read length
Target insert size of between 300 and 400bp

SSMP Samples
100 subjects (50 males and 50 females) from the Singapore Population Health Study

Singapore Sequencing Indian Project (SSIP)
mkdir -p ~/annotation/variation/human/SSIP
cd ~/annotation/variation/human/SSIP

http://www.statgen.nus.edu.sg/~SSIP/
http://www.statgen.nus.edu.sg/~SSIP/download/vcf/dataFreeze_Feb2013

明早看