探序基因肿瘤研究院 整理
对bam文件进行去重,使用sambamba:
sambamba-1.0.1-linux-amd64-static markdup -t 16 xxx.sort.bam xxx.sort.rmdup.bam
报错:
sambamba-markdup: Cannot open file `/tmp/sambamba-pid178526-markdup-ekoz/PairedEndsInfodiyl16' in mode `w+' (Too many open files)
于是去使用picard试试。
安装方法:
git clone https://github.com/broadinstitute/picard.git
cd picard
./gradlew shadowJar
本次使用的是picard-3.1.1版本,后来报错:
* What went wrong:
A problem occurred evaluating root project 'picard'.
> A Java 17 compatible (Java 17 or later) version is required to build GATK, but 11 was found. See https://github.com/broadinstitute/picard/blob/master/README.md#building-picard for information on how to build picard.
配置一下环境变量:
export JAVA_HOME=/datadisk/software/jdk-17.0.12
export PATH=$JAVA_HOME/bin:$PATH
或者:
wget https://github.com/broadinstitute/picard/releases/latest/download/picard.jar
注意java版本,如果版本不满足,会报错,直接下载满足条件的java环境就行。
输入:java -jar picard.jar MarkDuplicates可看到帮助信息:
Required Arguments:
--INPUT,-I <String> One or more input SAM, BAM or CRAM files to analyze. Must be coordinate sorted. This
argument must be specified at least once. Required.
--METRICS_FILE,-M <File> File to write duplication metrics to Required.
--OUTPUT,-O <File> The output file to write marked records to Required.
可选的话,还有:
--TMP_DIR <File>
--DUPLEX_UMI <Boolean> Treat UMIs as being duplex stranded. This option requires that the UMI consist of two
equal length strings that are separated by a hyphen (e.g. 'ATC-GTC'). Reads are considered
duplicates if, in addition to standard definition, have identical normalized UMIs. A UMI
from the 'bottom' strand is normalized by swapping its content around the hyphen (eg.
ATC-GTC becomes GTC-ATC). A UMI from the 'top' strand is already normalized as it is.
Both reads from a read pair considered top strand if the read 1 unclipped 5' coordinate is
less than the read 2 unclipped 5' coordinate. All chimeric reads and read fragments are
treated as having come from the top strand. With this option is it required that the
BARCODE_TAG hold non-normalized UMIs. Default false. Default value: false. Possible
values: {true, false}
这个应该是ctdna分析时用到的?
好像没有线程参数?
运行示例:
picard MarkDuplicates \
I=sample_sorted.bam \ # 输入BAM文件
O=sample_rmDup.bam \ # 输出BAM文件
M=metrics.txt \ # 重复读数的统计信息文件
REMOVE_DUPLICATES=true \ # 移除重复读数
MAX_FILE_HANDLES=800 \ # 最大文件句柄数
VALIDATION_STRINGENCY=LENIENT \ # 验证数据的严格性
2>>log/picard_MarkDuplicates_log.txt & # 将标准错误输出追加到日志文件,后台执行
参考: