ATAC-seq数据质量评估注意
ENCODE的ATACseq数据标准。
Uniform Processing Pipeline Restrictions
- The
read lengthprior to any trimming should be a minimum of45 base pairs. - Sequencing may be
paired- orsingle-ended,sequencing typeis specified andpaired sequencesare indicated. - All
Illumina platformsare supported for use in the uniform pipeline, thoughdata from different platforms should be processed separately; colorspace (SOLiD) reads arenotsupported. Barcodes, if present in the fastq, must beindicated.Library insert sizerange must beindicated.
Current Standards
- 必须有两次或更多次
生物学重复(稀有样本也必须做两次技术重复) - 每次重复要有25million非冗余,非线粒体,能够回帖的fragment(单端25 million reads,
双端50 million reads=25 million fragment) - 回帖率>95%, >80%可接受。
- 用
IDR(Irreproducible Discovery Rate)计算重复一致性,rescue和self consisty ratios 都>2 - 用以下指标控制PCR扩增对文库复杂性的影响:
Non-Redundant Fraction (NRF)and PCR Bottlenecking Coefficients 1 and 2, or PBC1 and PBC2:NRF>0.9, PBC1>0.9, PBC2>3 - peak文件必须满足如下要求:
每个重复peak数>150000,>100000可接受(ENCODE的ATAC-seq的peak file没法用)
IDR peak>70000,>50000可接受
要存在无核小体区NFR
存在单核小体峰,好的ATACseq数据应包含核小体,既能看开放染色质,又能看核小体
- The fraction of reads in called peak regions(FRip score)>0.3,>0.2 可以接受。对于稀有样本不要求FRiP但TSS富集还是要作为关键的衡量信噪比的指标。
- TSS富集分数阈值与参考基因组相关。
ATACseq 主干分析流程
reference:
1.文章:https://peerj.com/articles/4040/
2.CHIPseq课程:https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course/tree/master/sessionV/lessons
3.Harvard FAS Informatics - ATAC-seq Guidelines:https://informatics.fas.harvard.edu/atac-seq-guidelines.html
Harvard FAS Informatics - ATAC-seq Guidelines
Experimental design
- Replicates
- Controls:一般
不设置对照。作用有限,费用。没有转座酶处理的样本测序 - PCR amplification:
尽可能少地使用PCR循环来扩增样本,减少干扰 - Sequencing depth:最佳测序深度取决于
参考基因组的大小和预期染色质的开放程度。人类样本的研究推荐每个样本超过5000万个reads。 - Sequencing mode:(1) ATACseq推荐使用paired-end。paired-end sequencing, helps to
reduce these alignment ambiguities. (2) we are interested inknowing both ends of the DNA fragments generated by the assay, since the ends indicate where the transposase inserted. (3)PCR duplicates are identified more accurately. PCR duplicates areartifacts of the procedure, and they should be removed as part of the analysis pipeline .Computational programs that remove PCR duplicates typically identify duplicates based on comparing ends of aligned reads. Withsingle-end reads, there isonly one position to compare, and so anyreads whose 5' ends match are considered duplicates. Thus,many false positivesmay result, and perfectlygood reads are removedfrom further analysis.Paired-end sequencing,both ends of the original DNA fragments are defined.To be declared a duplicate, both ends of one fragment need to match both ends of another fragment, which is farless likely to occur by chance. Therefore, paired-end sequencing leads tofewer false positives. - Mitochondria: 众所周知ATAC-seq数据通常包含很大比例的
来自线粒体DNA的reads(线粒体DNA是裸露的,也可以被Tn5酶识别切割,植物叶绿体)。线粒体基因组中没有ATAC-seq感兴趣的峰,这些reads在计算中被丢弃,浪费测序资源。可在测序前使用洗涤剂去除样本中的线粒体。
Quality control
FastQC
Process a file of 20 million reads in about 5 minutes with less than 250MB memory used. Quality scores, GC levels, PCR duplicates, and adapter content.
Adapter removal
For reads derived from short DNA fragments, the 3' ends may contain portions of the Illumina sequencing adapter.
Cutadapt
NGmerge
Unlike cutadapt, NGmerge does not require that the adapter sequences be provided, nor does it require a parameter for the minimum length of adapter to match (in

本文详细阐述了ATAC-seq数据的质量评估标准,包括生物和技术重复次数、非冗余片段、PCR扩增影响控制、peak文件要求等。此外,讲解了实验设计、数据预处理步骤、主要分析工具如Genrich和MACS2的使用,以及后续的可视化、比较和注释。
最低0.47元/天 解锁文章
588

被折叠的 条评论
为什么被折叠?



