使用sambamba去重遇到的问题

探序基因肿瘤研究院  整理

对bam文件进行去重,使用sambamba:

sambamba-1.0.1-linux-amd64-static markdup -t 16 xxx.sort.bam xxx.sort.rmdup.bam

报错:

sambamba-markdup: Cannot open file `/tmp/sambamba-pid178526-markdup-ekoz/PairedEndsInfodiyl16' in mode `w+' (Too many open files)

于是去使用picard试试。

安装方法:

git clone https://github.com/broadinstitute/picard.git

cd picard

./gradlew shadowJar

本次使用的是picard-3.1.1版本,后来报错:

* What went wrong:
A problem occurred evaluating root project 'picard'.
> A Java 17 compatible (Java 17 or later) version is required to build GATK, but 11 was found. See https://github.com/broadinstitute/picard/blob/master/README.md#building-picard for information on how to build picard.

配置一下环境变量:

export JAVA_HOME=/datadisk/software/jdk-17.0.12
export PATH=$JAVA_HOME/bin:$PATH

或者:

wget https://github.com/broadinstitute/picard/releases/latest/download/picard.jar

注意java版本,如果版本不满足,会报错,直接下载满足条件的java环境就行。

输入:java -jar picard.jar MarkDuplicates可看到帮助信息:

Required Arguments:

--INPUT,-I <String>           One or more input SAM, BAM or CRAM files to analyze. Must be coordinate sorted.  This
                              argument must be specified at least once. Required. 

--METRICS_FILE,-M <File>      File to write duplication metrics to  Required. 

--OUTPUT,-O <File>            The output file to write marked records to  Required. 
可选的话,还有:

--TMP_DIR <File>

--DUPLEX_UMI <Boolean>        Treat UMIs as being duplex stranded.  This option requires that the UMI consist of two
                              equal length strings that are separated by a hyphen (e.g. 'ATC-GTC'). Reads are considered
                              duplicates if, in addition to standard definition, have identical normalized UMIs.  A UMI
                              from the 'bottom' strand is normalized by swapping its content around the hyphen (eg.
                              ATC-GTC becomes GTC-ATC).  A UMI from the 'top' strand is already normalized as it is.
                              Both reads from a read pair considered top strand if the read 1 unclipped 5' coordinate is
                              less than the read 2 unclipped 5' coordinate. All chimeric reads and read fragments are
                              treated as having come from the top strand. With this option is it required that the
                              BARCODE_TAG hold non-normalized UMIs. Default false.  Default value: false. Possible
                              values: {true, false}

这个应该是ctdna分析时用到的?

好像没有线程参数?

运行示例:

picard MarkDuplicates \
    I=sample_sorted.bam \  # 输入BAM文件
    O=sample_rmDup.bam \   # 输出BAM文件
    M=metrics.txt \        # 重复读数的统计信息文件
    REMOVE_DUPLICATES=true \ # 移除重复读数
    MAX_FILE_HANDLES=800 \   # 最大文件句柄数
    VALIDATION_STRINGENCY=LENIENT \ # 验证数据的严格性
    2>>log/picard_MarkDuplicates_log.txt & # 将标准错误输出追加到日志文件,后台执行

参考:

简书-picard sambamba linux安装使用(202406)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值