Statistic Data From file via Java

本文介绍了一种通过程序自动统计和分析多个文件中数据的方法,实现了数据合并、去重、排序及按前四位数字进行分类计数的功能。

由于一个朋友工作的需要,需要统计很多文件里的数据,由于手工统计的工作量比较大,于是求助于我通过程序来统计。
需求:将所有数据从小到大合并成一个文件,重复数据只保留一个!并分别统计出前四位相同数据的个数!


数据片段:

19610618,
19610619,
19610622,
19610623,
19610718,
19980814,
19980820,
19980822,
19980831,

20040804,
20040809,
20040810,
20050405,
20050429,
20050619,

 

数据分散在不同的文件里面

/**
 * @author Seave
 * @since 2006-06-25
 * @version 1.0
 * 文件数据统计
 * 将所有数据从小到大合并成一个文件,重复数据只保留一个!
 * 并分别统计出前四位相同数据的个数!
 */
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;

import java.io.IOException;

public class DataStatistic
{
 //原始数据文件目录
 private final static String directoryPath = "d://chengqiao//";
 
 //统计后生成的文件
 private final static String geneteratePath = "d://statisticResult.txt";
 /**
  * @author Seave
  * @since 2006-06-25
  * read file
  * @return
  */
 private static ArrayList readFile()
 {
  ArrayList al = new ArrayList();
  File file = new File(directoryPath);
  String fileList[] = file.list();
  
  for(int i = 0;i < fileList.length; i++)
  {
   //得到文件名,放在list数组里
   al.add(fileList[i]);
  }
  return al;
 }
 
 /**
  * @author Seave
  * @since 2005-06-25
  * combin file,同时去掉重复数据
  *
  */
 private static ArrayList combinData()
 {
  ArrayList alLine = new ArrayList();
  try
  {
   ArrayList al = readFile();
   for(int i = 0;i < al.size();i++)
   {
    String fileName = (String)al.get(i);
    FileInputStream fin = new FileInputStream(directoryPath+fileName);
    BufferedReader bRead = new BufferedReader(new InputStreamReader(fin));
          String line = "";
    while ((line = bRead.readLine())!= null)
          {
     if(alLine.contains(line))
     {
      //已经存入,则不add进去了
      continue;
     }
     alLine.add(line);
          }
    fin.close();
    bRead.close();
   }
  }
  catch(IOException ioe)
  {
   ioe.printStackTrace();
  }
  return alLine;
 }
 
 
 /**
  * @author Seave
  * @since 2005-06-25
  * 排序,从小到大排序
  *
  */
 private static ArrayList sortData()
 {
  //存放排序后的数据
  ArrayList alSort = new ArrayList();
  //需要排序的数据
  ArrayList al = combinData();
  //clone ArrayList Data
  ArrayList alClone = (ArrayList)al.clone();
  
  while(al.size() > 0)
  {
   String minData = (String)al.get(0);
   int tempIndex = 0;  
   if(alSort.contains(minData))
   {
    //如果此数据已经排序,则删除此数据,同时重新开始特循环
    al.remove(minData);
    continue;
   }
   for(int j = 0;j < alClone.size();j++)
   { 
    String tempData = (String)alClone.get(j);
    if(minData.compareTo(tempData) > 0)
    {
     //当前数据同克隆数组里的数据相比较
     minData = tempData;
     tempIndex = j;
    }
   }
   
//   System.out.println("minData result = " + minData);
//   System.out.println("================================"); 
   alSort.add(minData);
   //当得到最小的数据时,把此次循环得到的最小数据移出克隆数组
   alClone.remove(tempIndex);   
  }
  return alSort;
 }
 
 /**
  * @author Seave
  * @since 2005-06-25
  * 分别统计出前四位相同数据的个数
  *
  */
 private static ArrayList statisticData()
 {
  ArrayList fourAl = new ArrayList();
  ArrayList fourAndCountAl = new ArrayList();
  ArrayList al = sortData();
  for(int i = 0;i < al.size();i++)
  {
   String alStr = (String)al.get(i);
   //获取前四位
   String fourStartStr = alStr.substring(0,4);
   if(fourAl.contains(fourStartStr))
    continue;
   fourAl.add(fourStartStr);
  }
  for(int i = 0;i < fourAl.size();i++)
  {
   String year = (String)fourAl.get(i);
   int count = 0;
   for(int j = 0;j < al.size();j++)
   {
    String alStr = (String)al.get(j);
    String fourStartStr = alStr.substring(0,4);
    if(fourStartStr.equals(year))
    {
     //计算相同的年份的数量
     count++;
    }
   }
   
   fourAndCountAl.add(year+":" + count+"条数据");
  }
  return fourAndCountAl;
 }
 
 /**
  * @author Seave
  * @since 2005-06-25
  * 写文件
  *
  */
 private static void writeFile(ArrayList al)
 {
  try
  {
   File f = new File(geneteratePath);
   if(!f.exists())
    f.createNewFile();
   
   FileOutputStream fout = new FileOutputStream(geneteratePath);
   
   for(int i = 0; i < al.size();i++)
   {
    byte b[] = new byte[al.size()];
    b = new String((String)al.get(i)+"/n").getBytes();
    //写入文件
    fout.write(b);
   }
   fout.flush();
   //close FileOutputStream
   fout.close();
  }
  catch(IOException ioe)
  {
   ioe.printStackTrace();
  }
 }
 
 
 public static void main(String[] args)
 {
  ArrayList alSort = sortData();
  ArrayList alStatic = statisticData();
  for(int i = 0;i < alSort.size();i++)
  {
   //组织要同一个list里
   alStatic.add(alSort.get(i));
  }
  writeFile(alStatic);
 }
}

 

最后的统计结果片段:

1961:66条数据
1962:61条数据
1963:62条数据
1964:64条数据

1998:52条数据
1999:34条数据
2000:42条数据
2001:42条数据
2002:40条数据
2003:44条数据
2004:45条数据
2005:48条数据

19610618,
19610619,
19610622,
19610623,

20040827,
20041215,
20050310,
20050311,

 

 

注意” yahs -h Usage: yahs [options] <contigs.fa> <hic.bed>|<hic.bam>|<hic.pa5>|<hic.bin> Options: -a FILE AGP file (for rescaffolding) [none] -r INT[,INT,...] list of resolutions in ascending order [automate] -R INT rounds to run at each resoultion level [1] -e STR restriction enzyme cutting sites [none] -l INT minimum length of a contig to scaffold [0] -q INT minimum mapping quality [10] --no-contig-ec do not do contig error correction --no-scaffold-ec do not do scaffold error correction --no-mem-check do not do memory check at runtime --file-type STR input file type BED|BAM|PA5|BIN, file name extension is ignored if set --read-length read length (required for PA5 format input) [150] --telo-motif STR telomeric sequence motif -o STR prefix of output files [yahs.out] -v INT verbose level [0] -? print long help with extra option list --version show version number (Hic) [scb3201@ln137%bscc-a6 Hic]$ which juicer_tools ~/anaconda3/envs/Hic/bin/juicer_tools (Hic) [scb3201@ln137%bscc-a6 Hic]$ juicer_tools -h WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARN [2025-12-05T22:53:25,675] [Globals.java:138] [main] Development mode is enabled Juicer Tools Version 2.20.00 Usage: dump <observed/oe> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize> [outfile] dump <norm/expected> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] dump <loops/domains> <hicFile URL> [outfile] pre [options] <infile> <outfile> <genomeID> addNorm <input_HiC_file> [input_vector_file] pearsons [-p] <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] eigenvector -p <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] apa <hicFile(s)> <PeaksFile> <SaveFolder> arrowhead <hicFile(s)> <output_file> hiccups <hicFile> <outputDirectory> hiccupsdiff <firstHicFile> <secondHicFile> <firstLoopList> <secondLoopList> <outputDirectory> validate <hicFile> -h, --help print help -v, --verbose verbose mode -V, --version print version Type juicer_tools <commandName> for more detailed usage instructions (Hic) [scb3201@ln137%bscc-a6 Hic]$ bedtools -h bedtools is a powerful toolset for genome arithmetic. Version: v2.31.1 About: developed in the quinlanlab.org and by many contributors worldwide. Docs: http://bedtools.readthedocs.io/ Code: https://github.com/arq5x/bedtools2 Mail: https://groups.google.com/forum/#!forum/bedtools-discuss Usage: bedtools <subcommand> [options] The bedtools sub-commands include: [ Genome arithmetic ] intersect Find overlapping intervals in various ways. window Find overlapping intervals within a window around an interval. closest Find the closest, potentially non-overlapping interval. coverage Compute the coverage over defined intervals. map Apply a function to a column for each overlapping interval. genomecov Compute the coverage over an entire genome. merge Combine overlapping/nearby intervals into a single interval. cluster Cluster (but don't merge) overlapping/nearby intervals. complement Extract intervals _not_ represented by an interval file. shift Adjust the position of intervals. subtract Remove intervals based on overlaps b/w two files. slop Adjust the size of intervals. flank Create new intervals from the flanks of existing intervals. sort Order the intervals in a file. random Generate random intervals in a genome. shuffle Randomly redistribute intervals in a genome. sample Sample random records from file using reservoir sampling. spacing Report the gap lengths between intervals in a file. annotate Annotate coverage of features from multiple files. [ Multi-way file comparisons ] multiinter Identifies common intervals among multiple interval files. unionbedg Combines coverage intervals from multiple BEDGRAPH files. [ Paired-end manipulation ] pairtobed Find pairs that overlap intervals in various ways. pairtopair Find pairs that overlap other pairs in various ways. [ Format conversion ] bamtobed Convert BAM alignments to BED (& other) formats. bedtobam Convert intervals to BAM records. bamtofastq Convert BAM records to FASTQ records. bedpetobam Convert BEDPE intervals to BAM records. bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals. [ Fasta manipulation ] getfasta Use intervals to extract sequences from a FASTA file. maskfasta Use intervals to mask sequences from a FASTA file. nuc Profile the nucleotide content of intervals in a FASTA file. [ BAM focused tools ] multicov Counts coverage from multiple BAMs at specific intervals. tag Tag BAM alignments based on overlaps with interval files. [ Statistical relationships ] jaccard Calculate the Jaccard statistic b/w two sets of intervals. reldist Calculate the distribution of relative distances b/w two files. fisher Calculate Fisher statistic b/w two feature files. [ Miscellaneous tools ] overlap Computes the amount of overlap from two intervals. igv Create an IGV snapshot batch script. links Create a HTML page of links to UCSC locations. makewindows Make interval "windows" across a genome. groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy") expand Replicate lines based on lists of values in columns. split Split a file into multiple files with equal records or base pairs. summary Statistical summary of intervals in a file. [ General Parameters ] --cram-ref Reference used by a CRAM input [ General help ] --help Print this help menu. --version What version of bedtools are you using?. --contact Feature requests, bugs, mailing lists, etc. (Hic) [scb3201@ln137%bscc-a6 Hic]$ bwa -h [main] unrecognized command '-h' (Hic) [scb3201@ln137%bscc-a6 Hic]$ bwa Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.19-r1273 Contact: Heng Li <hli@ds.dfci.harvard.edu> Usage: bwa <command> [options] Command: index index sequences in the FASTA format mem BWA-MEM algorithm fastmap identify super-maximal exact matches pemerge merge overlapping paired ends (EXPERIMENTAL) aln gapped/ungapped alignment samse generate alignment (single ended) sampe generate alignment (paired ended) bwasw BWA-SW for long queries (DEPRECATED) shm manage indices in shared memory fa2pac convert FASTA to PAC format pac2bwt generate BWT from PAC pac2bwtgen alternative algorithm for generating BWT bwtupdate update .bwt to the new format bwt2sa generate SA from BWT and Occ Note: To use BWA, you need to first index the genome with `bwa index'. There are three alignment algorithms in BWA: `mem', `bwasw', and `aln/samse/sampe'. If you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for the manual. (Hic) [scb3201@ln137%bscc-a6 Hic]$ assembly-stats usage: stats [options] <list of fasta/q files> Reports sequence length statistics from fasta and/or fastq files options: -l <int> Minimum length cutoff for each sequence. Sequences shorter than the cutoff will be ignored [1] -s Print 'grep friendly' output -t Print tab-delimited output -u Print tab-delimited output with no header line -v Print version and exit“
最新发布
12-06
<think>我们注意到用户之前尝试使用yahs进行Hi-C scaffolding,并试图将结果转换为juicer格式时遇到了一系列问题。根据提供的帮助信息,我们重新设计一个完整的流程,从原始Hi-C数据开始,到最终生成juicer的.hic文件。 步骤: 1. 准备参考基因组索引和染色体大小文件。 2. 使用bwa mem对Hi-C reads进行比对。 3. 使用samtools将比对结果转换为BAM并排序。 4. 使用pairtools对BAM文件进行配对和去重。 5. 使用yahs进行scaffolding。 6. 将yahs的输出转换为juicer格式。 7. 使用juicer_tools生成.hic文件。 由于用户环境中有bwa, samtools, bedtools, pairtools, yahs, juicer_tools等工具,我们编写一个完整的脚本。 注意:我们假设用户有两个fastq文件(read1和read2)和一个参考基因组fasta文件。 我们将脚本分为几个部分,并详细注释。 </think> # 完整的Hi-C分析流程:基于Yahs和Juicer Tools 根据您提供的工具帮助信息,我设计了一个完整的Hi-C分析流程,从原始数据到最终的可视化文件。以下是完整的脚本和详细说明: ```bash #!/bin/bash #SBATCH -p amd_256 #SBATCH -N 1 #SBATCH -n 32 #SBATCH -t 120:00:00 #SBATCH --mem=256G #SBATCH -J hic_analysis #SBATCH -o /public1/home/scb3201/rawdata/Hic/logs/%j_hic_out.log #SBATCH -e /public1/home/scb3201/rawdata/Hic/logs/%j_hic_err.log ########################################################## # 1. 环境设置与路径定义 ########################################################## echo "[INFO] 开始Hi-C分析流程... $(date)" # 激活Conda环境 source ~/.bashrc conda activate Hic || { echo "[ERROR] Hic环境激活失败! $(date)" exit 1 } # 核心路径 BASE_DIR="/public1/home/scb3201/rawdata/Hic" RESULTS_DIR="/public1/home/scb3201/results/genome/Hic" REF_GENOME="/public1/home/scb3201/results/genome/Medaka/genome.nextpolish.medaka.fasta" READ1="${BASE_DIR}/reads/hic_R1.fastq.gz" READ2="${BASE_DIR}/reads/hic_R2.fastq.gz" # 子目录 ALIGN_DIR="${RESULTS_DIR}/alignment" SCAFFOLD_DIR="${RESULTS_DIR}/scaffold" JUICEBOX_DIR="${RESULTS_DIR}/juicebox" LOG_DIR="${BASE_DIR}/logs" TMP_DIR="${BASE_DIR}/tmp" # 创建目录 mkdir -p ${ALIGN_DIR} ${SCAFFOLD_DIR} ${JUICEBOX_DIR} ${LOG_DIR} ${TMP_DIR} # 设置临时目录 export TMPDIR=${TMP_DIR} export _JAVA_OPTIONS="-Xmx200G -Xms64G" ########################################################## # 2. 参考基因组准备 ########################################################## echo "[STEP] 准备参考基因组... $(date)" # 生成BWA索引 if [[ ! -f ${REF_GENOME}.bwt ]]; then echo "[INFO] 创建BWA索引..." bwa index ${REF_GENOME} 2> ${LOG_DIR}/bwa_index.log || { echo "[ERROR] BWA索引创建失败! $(date)" exit 1 } fi # 生成染色体长度文件 if [[ ! -f ${REF_GENOME}.fai ]]; then echo "[INFO] 创建参考基因组索引..." samtools faidx ${REF_GENOME} 2> ${LOG_DIR}/samtools_faidx.log || { echo "[ERROR] 参考基因组索引创建失败! $(date)" exit 1 } fi # 创建染色体大小文件 cut -f1,2 ${REF_GENOME}.fai > ${JUICEBOX_DIR}/chrom.sizes 2> ${LOG_DIR}/chrom_sizes.log || { echo "[ERROR] 染色体大小文件生成失败! $(date)" exit 1 } ########################################################## # 3. Hi-C数据比对 ########################################################## echo "[STEP] 比对Hi-C reads... $(date)" BAM_OUT="${ALIGN_DIR}/aligned.bam" # 使用BWA MEM进行比对 bwa mem -t ${SLURM_NTASKS} -5SP ${REF_GENOME} ${READ1} ${READ2} 2> ${LOG_DIR}/bwa_mem.log | \ samtools view -h -b -F 2304 - > ${BAM_OUT} 2> ${LOG_DIR}/samtools_view.log || { echo "[ERROR] BWA比对失败! $(date)" exit 1 } # 排序BAM文件 echo "[INFO] 排序比对结果..." samtools sort -@ ${SLURM_NTASKS} -m 4G -T ${TMP_DIR}/sort_temp -o ${ALIGN_DIR}/aligned.sorted.bam ${BAM_OUT} 2> ${LOG_DIR}/samtools_sort.log || { echo "[ERROR] BAM排序失败! $(date)" exit 1 } # 索引BAM文件 samtools index ${ALIGN_DIR}/aligned.sorted.bam 2> ${LOG_DIR}/samtools_index.log || { echo "[WARNING] BAM索引创建失败,继续执行... $(date)" } ########################################################## # 4. 使用Yahs进行Scaffolding ########################################################## echo "[STEP] Yahs scaffolding... $(date)" # 转换BAM为Yahs输入格式 echo "[INFO] 准备Yahs输入文件..." samtools view -h ${ALIGN_DIR}/aligned.sorted.bam | \ awk 'BEGIN{OFS="\t"} { if($1 ~ /^@/) {print; next} if($7 != "=") next # 只保留正常配对的reads if($6 ~ /^[0-9]+S/) next # 过滤软裁剪 print }' | samtools view -b - > ${SCAFFOLD_DIR}/yahs_input.bam 2> ${LOG_DIR}/yahs_prep.log || { echo "[ERROR] Yahs输入准备失败! $(date)" exit 1 } # 运行Yahs echo "[INFO] 运行Yahs scaffolding..." yahs \ -o ${SCAFFOLD_DIR}/yahs_out \ ${REF_GENOME} \ ${SCAFFOLD_DIR}/yahs_input.bam 2> ${LOG_DIR}/yahs_run.log || { echo "[ERROR] Yahs运行失败! $(date)" exit 1 } ########################################################## # 5. 转换为Juicer格式 ########################################################## echo "[STEP] 转换为Juicer格式... $(date)" # 方法1: 使用Yahs内置功能 echo "[METHOD] 尝试Yahs内置转换..." yahs juicer \ ${SCAFFOLD_DIR}/yahs_out.bin \ ${JUICEBOX_DIR}/chrom.sizes \ ${SCAFFOLD_DIR}/yahs_out_scaffolds_final.agp \ ${REF_GENOME} \ ${JUICEBOX_DIR}/juicer 2> ${LOG_DIR}/yahs_juicer.log if [[ $? -ne 0 ]]; then # 方法2: 手动转换 echo "[METHOD] Yahs内置转换失败,使用备用转换方法..." # 步骤1: 提取有效交互 echo "[SUBSTEP] 提取有效交互..." samtools view -h ${SCAFFOLD_DIR}/yahs_out.bin | \ awk 'BEGIN{OFS="\t"} { if($1 ~ /^@/) next if($7 != "=") next # 只保留正常配对的reads split($1, a, "/"); read_id = a[1] if(read_id != prev_id) { prev_id = read_id next } print $3, $4, $4+length($10), $7, $8, $8+length($10) }' > ${JUICEBOX_DIR}/interactions.bedpe 2> ${LOG_DIR}/interactions_extract.log || { echo "[ERROR] 交互提取失败! $(date)" exit 1 } # 步骤2: 转换为Juicer格式 echo "[SUBSTEP] 转换为Juicer矩阵..." awk 'BEGIN{OFS="\t"} { # Juicer格式: <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> # 计算中点位置 pos1 = int(($2+$3)/2) pos2 = int(($5+$6)/2) # 输出交互矩阵 print 0, $1, pos1, 0, 0, $4, pos2, 1 }' ${JUICEBOX_DIR}/interactions.bedpe > ${JUICEBOX_DIR}/juicer.2.txt 2> ${LOG_DIR}/juicer_convert.log || { echo "[ERROR] Juicer格式转换失败! $(date)" exit 1 } # 步骤3: 创建片段文件 echo "[SUBSTEP] 创建片段文件..." awk 'BEGIN{OFS="\t"} { print $1, 0, $2, 1 }' ${JUICEBOX_DIR}/chrom.sizes > ${JUICEBOX_DIR}/juicer.3.txt 2> ${LOG_DIR}/juicer_fragments.log || { echo "[ERROR] 片段文件创建失败! $(date)" exit 1 } # 复制染色体大小文件 cp ${JUICEBOX_DIR}/chrom.sizes ${JUICEBOX_DIR}/juicer.chromsizes fi ########################################################## # 6. 生成.hic文件 ########################################################## echo "[STEP] 生成.hic文件... $(date)" # 排序Juicer文件 echo "[INFO] 排序Juicer输入文件..." LC_ALL=C sort -k2,2d -k6,6d -k1,1d -k5,5d -T ${TMP_DIR} --parallel=${SLURM_NTASKS} \ ${JUICEBOX_DIR}/juicer.2.txt > ${JUICEBOX_DIR}/juicer.2.sorted.txt 2> ${LOG_DIR}/juicer_sort.log || { echo "[WARNING] 排序失败,使用原始文件... $(date)" mv ${JUICEBOX_DIR}/juicer.2.txt ${JUICEBOX_DIR}/juicer.2.sorted.txt } # 使用juicer_tools创建.hic文件 echo "[INFO] 运行juicer_tools..." juicer_tools pre \ -j ${SLURM_NTASKS} \ -t ${TMP_DIR} \ ${JUICEBOX_DIR}/juicer.2.sorted.txt \ ${JUICEBOX_DIR}/aligned.hic \ ${JUICEBOX_DIR}/juicer.chromsizes 2> ${LOG_DIR}/juicer_pre.log || { echo "[ERROR] .hic文件生成失败! $(date)" # 备选方法:直接使用Java调用 echo "[FIX] 尝试Java直接调用..." JUICER_JAR=$(find ${CONDA_PREFIX}/bin -name 'juicer_tools*.jar' | head -1) if [[ -f ${JUICER_JAR} ]]; then java -jar ${JUICER_JAR} pre \ -j ${SLURM_NTASKS} \ -t ${TMP_DIR} \ ${JUICEBOX_DIR}/juicer.2.sorted.txt \ ${JUICEBOX_DIR}/aligned.hic \ ${JUICEBOX_DIR}/juicer.chromsizes 2> ${LOG_DIR}/juicer_java.log || { echo "[CRITICAL] Java调用失败! $(date)" exit 1 } else echo "[CRITICAL] 找不到Juicer Tools JAR文件! $(date)" exit 1 fi } ########################################################## # 7. 结果验证与报告 ########################################################## echo "[STEP] 验证结果... $(date)" # 验证.hic文件 juicer_tools validate ${JUICEBOX_DIR}/aligned.hic > ${LOG_DIR}/validate.log 2>&1 VALIDATION_EXIT=$? # 生成统计报告 echo "==== Hi-C分析结果报告 ====" > ${LOG_DIR}/final_report.txt echo "生成时间: $(date)" >> ${LOG_DIR}/final_report.txt echo "运行时长: $SECONDS 秒" >> ${LOG_DIR}/final_report.txt echo ".hic文件大小: $(du -sh ${JUICEBOX_DIR}/aligned.hic | cut -f1)" >> ${LOG_DIR}/final_report.txt echo "验证结果: $(grep -i 'valid' ${LOG_DIR}/validate.log | head -1)" >> ${LOG_DIR}/final_report.txt echo "Yahs输出文件:" >> ${LOG_DIR}/final_report.txt ls -lh ${SCAFFOLD_DIR}/yahs_out* >> ${LOG_DIR}/final_report.txt # 染色体交互统计 if [[ ${VALIDATION_EXIT} -eq 0 ]]; then echo "染色体交互统计:" >> ${LOG_DIR}/final_report.txt juicer_tools stats ${JUICEBOX_DIR}/aligned.hic >> ${LOG_DIR}/final_report.txt 2> ${LOG_DIR}/stats.log fi ########################################################## # 8. 清理与完成 ########################################################## echo "[INFO] 清理临时文件... $(date)" rm -rf ${TMP_DIR} ${ALIGN_DIR}/aligned.bam ${JUICEBOX_DIR}/interactions.bedpe ${JUICEBOX_DIR}/juicer.2.txt echo "[SUCCESS] Hi-C分析完成! $(date)" echo "结果文件: ${JUICEBOX_DIR}/aligned.hic" echo "报告文件: ${LOG_DIR}/final_report.txt" ``` ## 关键工具使用说明 ### 1. BWA MEM比对 ```bash bwa mem -t 32 -5SP ${REF_GENOME} ${READ1} ${READ2} ``` - `-t 32`: 使用32个线程 - `-5SP`: 特殊参数,适用于Hi-C数据比对 - `-5`: 对于包含/1和/2的read名称使用特殊处理 - `-S`: 跳过配对对齐(适用于长reads) - `-P`: 标记次要对齐位置(用于下游处理) ### 2. Samtools处理 ```bash samtools view -h -b -F 2304 ``` - `-F 2304`: 过滤标志位 - 2304 = 512 (PCR重复) + 1024 (光学重复) + 768 (未映射/未配对) - 保留正常的配对比对结果 ### 3. Yahs scaffolding ```bash yahs -o yahs_out ${REF_GENOME} yahs_input.bam ``` - `-o`: 指定输出前缀 - 输入:参考基因组 + 过滤后的BAM文件 - 输出:`.bin` (压缩的交互矩阵), `.agp` (scaffold信息)等 ### 4. Juicer Tools转换 ```bash juicer_tools pre -j 32 -t tmp juicer.2.sorted.txt aligned.hic chrom.sizes ``` - `pre`: 预处理命令 - `-j 32`: 使用32个线程 - `-t tmp`: 指定临时目录 - 输入:排序后的交互矩阵 + 染色体大小文件 - 输出:`.hic` 二进制文件(用于可视化) ## 高级功能扩展 ### 1. 多分辨率分析 ```bash # 使用不同分辨率生成多个.hic文件 RESOLUTIONS=(1000000 500000 100000 50000 10000) for res in ${RESOLUTIONS[@]}; do juicer_tools pre \ -r ${res} \ -j ${SLURM_NTASKS} \ ${JUICEBOX_DIR}/juicer.2.sorted.txt \ ${JUICEBOX_DIR}/aligned_${res}.hic \ ${JUICEBOX_DIR}/juicer.chromsizes done ``` ### 2. 交互矩阵可视化 ```bash # 生成特定区域的交互矩阵 juicer_tools dump observed KR \ ${JUICEBOX_DIR}/aligned.hic \ chr1 chr1 \ BP 100000 \ ${JUICEBOX_DIR}/chr1_matrix.txt # 转换为热图格式 awk 'BEGIN{OFS="\t"} { print $1, $2, $3, $7 # 位置1, 位置2, 计数 }' ${JUICEBOX_DIR}/chr1_matrix.txt > ${JUICEBOX_DIR}/chr1_heatmap.txt ``` ### 3. Scaffolding质量评估 ```bash # 计算N50统计 assembly-stats -t ${SCAFFOLD_DIR}/yahs_out_scaffolds_final.fa > ${LOG_DIR}/scaffold_stats.txt # 比较原始和scaffolded基因组 diamond blastx \ -d nr \ -q ${REF_GENOME} \ -o ${SCAFFOLD_DIR}/original_blast.txt diamond blastx \ -d nr \ -q ${SCAFFOLD_DIR}/yahs_out_scaffolds_final.fa \ -o ${SCAFFOLD_DIR}/scaffolded_blast.txt ``` ## 错误处理与优化建议 ### 1. BWA比对优化 ```bash # 增加内存限制 bwa mem -t 32 -K 100000000 -Y ... ``` - `-K 100000000`: 每次处理100,000,000个reads - `-Y`: 使用soft-clipping进行补充对齐 ### 2. Yahs参数调优 ```bash yahs -r 10000,50000,100000 -R 3 ... ``` - `-r`: 指定分辨率列表(从低到高) - `-R`: 每个分辨率的迭代次数 ### 3. Juicer内存优化 ```bash # 增加Java内存 export _JAVA_OPTIONS="-Xmx200G -Xms64G" # 使用磁盘排序 juicer_tools pre --disk ... ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值