Variation calling and annotation

通过对302份野生及栽培大豆种质资源进行重测序,本研究鉴定了与大豆驯化及改良相关的基因。研究中使用了多种生物信息学工具,包括SAMtools、Picard、GATK等,进行了变异检测、注释等一系列分析。

Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean

本文摘自《Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean》

Variation calling and annotation.

Mapping.

SAMtools (Version: 0.1.18) software was used to convert mapping results into the BAM format and to filter the unmapped and non-unique reads.

Duplicated reads were filtered with the Picard package (picard.sourceforge.net, Version:1.87).

The BEDtools (Version: 2.17.0) coverageBed program was used to compute the coverage of sequence alignments. (A sequence was defined as absent if coverage was lower than 90% and present if coverage was greater than 90%.)

SNP calling.

SNP detection was performed using the Genome Analysis Toolkit (GATK, version 2.4-7-g5e89f01) and SAMtools. Only the SNPs detected by both methods were analyzed further.
The detailed processes were as follows:
(1) After BWA alignment, the reads around indels were realigned.
Realignment was performed with GATK in two steps.
The first step used the RealignerTargetCreator package to identify regions where realignment was needed;
The second step used IndelRealigner to realign the regions found in the first step, which produced a realigned BAM file for each accession.
(2) SNPs were called at a population level with GATK and SAMtools. For GATK, the SNP confidence score was set as greater than 30, and the parameter -stand_call_conf was set as 30. The same realigned BAM files were used in SNP calling through the SAMtools mpileup package.
(3) In the filter step, we chose the common sites identified by GATK and SAMtools with the SelectVariants package; SNPs with allele frequencies lower than 1% in the population were discarded.

Indel calling.

Indel calling was similar to SNP calling but with the UnifiedGenotyper parameter -glm INDEL for the indel report only. Only insertions and deletions shorter than or equal to 6 bp were taken into account.

Annotation.

SNP annotation was performed according to the genome using the package ANNOVAR (Version: 2013-08-23).
Based on the genome annotation, SNPs were categorized in exonic regions (overlapping with a coding exon), splicing sites (within 2 bp of a splicing junction), 5′UTRs and 3′UTRs, intronic regions (overlapping with an intron), upstream and downstream regions (within a 1 kb region upstream or downstream from the transcription start site), and intergenic regions.

SNPs in coding exons were further grouped into synonymous SNPs (did not cause amino acid changes) or nonsynonymous SNPs (caused amino acid changes; mutations causing stop gain and stop loss were also classified into this group).

Indels in the exonic regions were classified by whether they had frame-shift (3 bp insertion or deletion) mutations.

<think>好的,用户现在想查找关于全外显子组测序(WES)注释和过滤的工具或方法。首先,我需要回忆一下之前提供的引用内容,看看有没有相关的信息。引用[2]提到了使用ANNOVAR进行注释,还有VarSifter用于过滤,这可能有用。另外,引用[3]也提到了变异检测的数据分析,可能与注释和过滤相关。 接下来,我需要确定用户的具体需求。用户提到的是WES的注释和过滤工具,所以我应该重点介绍常用的软件和流程。例如,注释工具可能有ANNOVAR、VEP、SnpEff,这些工具能添加基因、功能影响等信息。过滤方面,需要考虑群体频率、质量指标、临床数据库等,工具如VarSifter、GEMINI、BCFtools。 然后,我需要确保回答结构清晰,分步骤说明。首先概述注释和过滤的重要性,然后分工具和方法两部分详细介绍。每个工具要简要说明其功能和特点,比如ANNOVAR支持多数据库,VEP有在线和离线版本。过滤部分要涵盖常见过滤参数和工具的使用方法。 还要注意用户提到的技术资料或方法,可能需要提到相关的数据库,比如gnomAD、ClinVar,以及过滤策略,比如去除常见变异、保留有害突变等。此外,流程整合部分可以提到GATK和Nextflow,帮助用户了解整体流程。 最后,生成相关问题,确保覆盖用户可能的后续疑问,比如工具比较、过滤参数设置、自动化流程等。同时,检查引用是否正确,引用[2]提到了ANNOVAR和VarSifter,引用[3]提到了变异检测,这些都需要正确标注。 需要避免使用用户不可见的格式,保持回答自然,使用中文,符合用户的要求。确保所有数学表达式和公式正确,但在这个问题中可能涉及较少,主要关注生物信息学工具和方法。</think>### 全外显子组测序(WES)注释与过滤方法 #### **1. 注释工具与方法** 全外显子组测序的注释旨在为检测到的变异(如SNV、Indel)添加功能、临床和群体数据库信息,帮助识别致病性变异。常用工具包括: - **ANNOVAR** 支持多数据库整合,包括基因注释(如RefSeq、Ensembl)、功能预测(如SIFT、PolyPhen)、群体频率(如gnomAD)、临床数据库(如ClinVar)。 示例命令: ```bash annotate_variation.pl -buildver hg38 input.vcf humandb/ [^2] ``` - **Variant Effect Predictor (VEP)** 基于Ensembl数据库,提供变异的基因影响预测(如错义突变、无义突变)、保守性评分(PhyloP)和通路注释。支持在线和离线模式[^2]。 - **SnpEff** 快速注释变异的基因功能影响(如外显子、内含子边界),并集成ClinVar、COSMIC等数据库。 示例命令: ```bash java -jar snpEff.jar -v GRCh38.99 input.vcf > annotated.vcf ``` --- #### **2. 过滤方法与工具** 过滤目标是从大量变异中筛选出与疾病相关的候选位点。关键步骤包括: - **质量过滤** 去除低质量变异(如测序深度 <20×,基因型质量GQ <20) 工具:`BCFtools`、`GATK VariantFiltration`[^3] - **群体频率过滤** 排除高频变异(如gnomAD中MAF >1%),保留罕见变异 - **功能影响过滤** 保留高影响变异(如错义突变、移码突变),使用`SnpEff`或`VEP`的预测标签 - **临床数据库筛选** 结合ClinVar(致病性评级)、OMIM(疾病关联)、HGMD(致病突变)注释[^2] - **自定义过滤脚本** 使用Python/Pandas或R对VCF文件进行多条件联合筛选(示例代码片段): ```python df = pd.read_csv('variants.csv') filtered = df[(df['gnomAD_MAF'] < 0.01) & (df['ClinVar_Significance'] == 'Pathogenic')] ``` --- #### **3. 流程整合工具** - **GEMINI** 集成注释、过滤和家系分析功能,支持SQL查询变异表 ```bash gemini query --header -q "SELECT * FROM variants WHERE impact_severity='HIGH'" db.sqlite ``` - **VarSifter** 图形化工具,支持动态交互式过滤(如滑动条调整MAF阈值) --- #### **4. 数据库资源** - **gnomAD**:群体频率参考 - **dbNSFP**:功能预测评分整合(如CADD、MetaLR) - **COSMIC**:癌症体细胞突变数据库 - **HGMD**:人类遗传病突变数据库(需订阅) ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值