About seq_file operations

本文深入探讨序列文件接口的内核实现,包括如何通过proc文件系统导出Linux内核数据结构,提供了一系列宏简化编程,并通过实例展示了双向循环链表、数组、哈希表、红黑树等数据结构的导出方法。

内容简介

本文主要讲述序列文件(seq_file)接口的内核实现,如何使用它将Linux内核里面常用的数据结构通过文件(主要关注proc文件)导出到用户空间,最后定义了一些宏以便于编程,减少重复代码。在分析序列文件接口实现的过程中,还连带涉及到一些应用陷阱和避免手段。

序列文件接口

UNIX的世界里,文件是最普通的概念,所以用文件来作为内核和用户空间传递数据的接口也是再普通不过的事情,并且这样的接口对于shell也是相当友好的,方便管理员通过shell直接管理系统。由于伪文件系统proc文件系统在处理大数据结构(大于一页的数据)方面有比较大的局限性,使得在那种情况下进行编程特别别扭,很容易导致bug,所以序列文件接口被发明出来,它提供了更加友好的接口,以方便程序员。之所以选择序列文件接口这个名字,应该是因为它主要用来导出一条条的记录数据。

为了能给大家一个具体形象的认识,我们首先来看一段用序列文件接口通过proc文件导出内核双向循环链接表的实例代码:


#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

static struct mutex lock;
static struct list_head head;
struct my_data {
        struct list_head list;
        int value;
};

static void add_one(void)
{
        struct my_data *data;

        mutex_lock(&lock);
        data = kmalloc(sizeof(*data), GFP_KERNEL);
        if (data != NULL)
                list_add(&data->list, &head);
        mutex_unlock(&lock);
}

static ssize_t _seq_write(struct file *file, const char __user * buffer,
                       size_t count, loff_t *ppos)
{
        add_one();
        return count;
}

static int _seq_show(struct seq_file *m, void *p)
{
        struct my_data *data = list_entry(p, struct my_data, list);


        seq_printf(m, "value: %d\n", data->value);
        return 0;
}

static void *_seq_start(struct seq_file *m, loff_t *pos)
{
        mutex_lock(&lock);
        return seq_list_start(&head, *pos);
}

static void *_seq_next(struct seq_file *m, void *p, loff_t *pos)
{
        return seq_list_next(p, &head, pos);
}

static void _seq_stop(struct seq_file *m, void *p)
{
        mutex_unlock(&lock);
}

static struct seq_operations _seq_ops = {
        .start = _seq_start,
        .next = _seq_next,
        .stop = _seq_stop,
        .show = _seq_show
};

static int _seq_open(struct inode *inode, struct file *file)
{
        return seq_open(file, &_seq_ops);
}

static struct file_operations _seq_fops = {
        .open = _seq_open,
        .read = seq_read,
        .write = _seq_write,
        .llseek = seq_lseek,
        .release = seq_release
};

static void clean_all(struct list_head *head)
{
        struct my_data *data;

        while (!list_empty(head)) {
                data = list_entry(head->next, struct my_data, list);
                list_del(&data->list);
                kfree(data);
        }
}

static int __init init(void)
{
        struct proc_dir_entry *entry;

        mutex_init(&lock);
        INIT_LIST_HEAD(&head);

        add_one();
        add_one();
        add_one();

        entry = create_proc_entry("my_data", S_IWUSR | S_IRUGO, NULL);
        if (entry == NULL) {
                clean_all(&head);
                return -ENOMEM;
        }
        entry->proc_fops = &_seq_fops;

        return 0;
}

static void __exit fini(void)
{
        remove_proc_entry("my_data", NULL);
        clean_all(&head);
}

module_init(init);
module_exit(fini);


注:以上代码需要内核版本不小于2.6.23,小版本内核可以从2.6.23内核代码中抽取函数seq_list_start和seq_list_next。

上面的这段程序创建了一个结构体my_data的双向循环链表(head),并且通过my_data proc文件将其导出到用户空间,用户空间的程序除了可以通过对my_data的读操作来获取my_data链表中各个结构体中的value域的值之外,还能通过写(write)系统调用添加一个具有随机value值的my_data结构体到双向循环链表中(从链表的头部添加)。

gentux kernel # cat /proc/my_data
value: 474615376
value: 474615632
value: 474615568
gentux kernel # echo 0 > /proc/my_data
gentux kernel # cat /proc/my_data
value: 758528120
value: 474615376
value: 474615632
value: 474615568

虽然,上面的那段代码行数不少,但是和输出相关的部分除了函数_seq_show外都是惯例代码,这种简便性是很直观的。相信对于双向循环链表数据结构的导出,读者们都能够照猫画虎写出自己的代码而毋须我再多言。

序列文件接口相关的两个重要数据结构为seq_file结构体和seq_operations表(表是延续ULK3的叫法,实际上就是只包含函数指针的结构体)。首先介绍seq_file结构体:
  • buf: 序列文件对应的数据缓冲区,要导出的数据都是首先打印到这个缓冲区,然后才被拷贝到用户指定的用户空间缓冲区。
  • size: 缓冲区的大小,默认为1个内存页面大小,随着需求会动态以2的级数倍扩张,比如:4k,8k,16k...
  • from: 没有拷贝到用户空间的数据在buf中的起始偏移量。
  • count: buf中没有被拷贝到用户空间的数据的字节数。
  • index: 数据项的索引,和稍后提到的seq_operations表中的start, next操作中的pos项一致。
  • version: 文件的版本。
  • lock: 序列化对这个文件的并行操作
  • op:指向稍后提到的seq_operations表。
  • private: 指向文件的私有数据,是特例化一个序列文件的方法。
seq_operations表中的函数指针成员如下:
  • start: 开始读数据项,通常需要在这个函数里面加锁,以防止并行访问数据。
  • stop: 停止读数据项,和start相对,通常需要解锁。
  • next:找到下一个要处理的数据项。
  • show:打印数据项到临时缓冲区。
其中start,next返回的值都会以第二个参数的形式传递给stop和show。start在*pos为0的时候还可以返回SEQ_START_TOKEN,通常这个值传递给show的时候,show会打印表格头。

一些有用的全局函数:
  • seq_open:通常会在打开文件的时候调用,以第二个参数为seq_operations表创建seq_file结构体。
  • seq_read, seq_lseek和seq_release:他们通常都直接对应着文件操作表中的read, llseek和release。
  • seq_escape:将一个字符串中的需要转义的字符(字节长)以8进制的方式打印到seq_file。
  • seq_putc, seq_puts, seq_printf:他们分别和C语言中的putc,puts和printf相对应。
  • seq_path:用于输出文件名。
  • single_open, single_release: 打开和释放只有一条记录的文件。
  • seq_open_private, __seq_open_private, seq_release_private:和seq_open类似,不过打开seq_file的时候创建一小块文件私有数据。
在2.6.23中,又为双向循环链表引入了三个帮助函数:
  • seq_list_start:返回链表中的特定项。
  • seq_list_start_head:在需要输出表格头的时候使用。
  • seq_list_next: 返回链表中的下一项。
这三个函数的具体用法可以参考上面的实例,不再赘述。

序列文件接口的内核实现

序列文件接口的实现比较简单易懂,出于完整性考虑,我姑且把核心函数seq_read拿来和读者一起解读:
  1. 如果是第一次对序列文件调用读操作,那么内核数据缓冲区指针(buf)为空,这个时候申请一整页内存作为缓冲区。
  2. 如果内核缓冲区中尚且有数据,那么先用它们填补用户缓冲区。
  3. 如果用户缓冲区仍有空间,则按序调用start和show,填补内核缓冲区。如果当前的内核缓冲区不足以容纳一条记录,那么将缓冲区大小加倍后再次尝试,直到可以容纳至少一条记录。然后持续调用next和show,直到内核缓冲区无空间容纳新的整条记录后调用stop终止。
可见,序列文件接口是面向整条记录的,但是它只能保证这些,无法保证导出的数据结构的整体一致性,这一点请多加留意,防止因为它导致bug。

因为序列文件的缓冲区有自动扩张的功能,所以它更便于导出大于一页的数据结构,这也正是single_open/release函数的用武之地。

内核常用数据结构的导出

应用序列文件接口导出常用的面向记录的数据结构通常是非常简单的,下面我将给出内核常用的数据结构通过序列文件接口导出的思路。

数组

数组通过其索引就能够随机地访问其中的任一元素,所以我们只要将start和next参数中的pos当成索引,返回对应元素的地址即可。

双向循环链表

  • start:跳过前pos个元素,返回即可。
  • next:增加pos,返回下一个元素。

哈希表

哈希表有两种导出方法:
  1. 以每个哈希桶为元素,这个时候哈希表退化为数组。
  2. 以每个哈希节点为元素,这个时候它退化为按桶连接起来的链表。
第1个方法,因为每个桶中的节点数不可控,所以可能需要大于一页的连续内存,这在一个运行了很长时间,内存充斥着外部碎片的系统上可能是不切实际的。相比之下,第2个方法可能更加有效。

红黑树

因为Linux内核提供了按照类似链表的方式遍历红黑树的方法,所以导出它的方法和双向链接表类似。

序列文件接口的帮助宏

实际上,应用序列文件接口进行编程,存在大量的重复代码,为了减少可恶的重复代码,也为了减少复制粘贴可能引入的bug,我写了这一套宏以简化大多数情况下的序列文件接口调用。具体宏代码如下:


#ifndef _SEQ_FILE_HELPER_H_
#define _SEQ_FILE_HELPER_H_

#include <linux/version.h>
#include <linux/list.h>
#include <linux/rbtree.h>
#include <linux/seq_file.h>

#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 23)
static inline struct list_head *seq_list_start(struct list_head *head,
                loff_t pos)
{
        struct list_head *lh;

        list_for_each(lh, head)
                if (pos-- == 0)
                        return lh;

        return NULL;
}

static inline struct list_head *seq_list_next(void *v, struct list_head *head,
                loff_t *ppos)
{
        struct list_head *lh;

        lh = ((struct list_head *)v)->next;
        ++*ppos;
        return lh == head ? NULL : lh;
}
#endif

/* common macro */
#define __DEFINE_FOO_SEQ_FOPS(prefix, foo, imp_show, imp_write) \
static struct seq_operations _##foo##_seq_ops = { \
        .start = _##foo##_seq_start, \
        .next = _##foo##_seq_next, \
        .stop = _##foo##_seq_stop, \
        .show = imp_show \
}; \
 \
static int _##foo##_seq_open(struct inode *inode, struct file *file) \
{ \
        return seq_open(file, &_##foo##_seq_ops); \
} \
 \
prefix struct file_operations foo##_seq_fops = { \
        .open = _##foo##_seq_open, \
        .read = seq_read, \
        .write = imp_write, \
        .llseek = seq_lseek, \
        .release = seq_release, \

        .owner = THIS_MODULE

};

/* helper for array */
#define _DEFINE_ARRAY_SEQ_FOPS(prefix, arr, n, lock, unlock, imp_show, \
                imp_write) \
static void* _##arr##_seq_start(struct seq_file *m, loff_t *pos) \
{ \
        lock; \
        if (*pos >= n) \
                return NULL; \
        return &arr[*pos]; \
} \
 \
static void* _##arr##_seq_next(struct seq_file *m, void *p, loff_t *pos) \
{ \
        if (++*pos >= n) \
                return NULL; \
        return &arr[*pos]; \
} \
\
static void _##arr##_seq_stop(struct seq_file *m, void *p) \
{ \
        unlock; \
} \
 \
__DEFINE_FOO_SEQ_FOPS(prefix, arr, imp_show, imp_write)

#define DEFINE_ARRAY_SEQ_FOPS(arr, n, lock, unlock, imp_show, imp_write) \
        _DEFINE_ARRAY_SEQ_FOPS(, arr, n, lock, unlock, imp_show, imp_write)

#define DEFINE_STATIC_ARRAY_SEQ_FOPS(arr, n, lock, unlock, imp_show, \
                imp_write) \
        _DEFINE_ARRAY_SEQ_FOPS(static, arr, n, lock, unlock, imp_show, \
                        imp_write)

/* helper for list */
#define _DEFINE_LIST_SEQ_FOPS(prefix, list, lock, unlock, imp_show, imp_write) \
static void* _##list##_seq_start(struct seq_file *m, loff_t *pos) \
{ \
        lock; \
        return seq_list_start(&list, *pos); \
} \
 \
static void* _##list##_seq_next(struct seq_file *m, void *p, loff_t *pos) \
{ \
        return seq_list_next(p, &list, pos); \
} \
\
static void _##list##_seq_stop(struct seq_file *m, void *p) \
{ \
        unlock; \
} \
 \
__DEFINE_FOO_SEQ_FOPS(prefix, list, imp_show, imp_write)

#define DEFINE_LIST_SEQ_FOPS(list, lock, unlock, imp_show, imp_write) \
        _DEFINE_LIST_SEQ_FOPS(, list, lock, unlock, imp_show, imp_write)

#define DEFINE_STATIC_LIST_SEQ_FOPS(list, lock, unlock, imp_show, imp_write) \
        _DEFINE_LIST_SEQ_FOPS(static, list, lock, unlock, imp_show, imp_write)

/* helper for hlist */
#ifndef hlist_for_each_continue
#define hlist_for_each_continue(pos) \
        for (pos = pos->next; pos && ({ prefetch(pos->next); 1; }); \
                        pos = pos->next)
#endif

#define _DEFINE_HLIST_SEQ_FOPS(prefix, head, n, lock, unlock, imp_show, \
                imp_write) \
struct _##head##_seq_iter { \
        unsigned int idx; \
        struct hlist_node *node; \
}; \
 \
static void* _##head##_seq_start(struct seq_file *m, loff_t *pos) \
{ \
        struct hlist_node *node; \
        unsigned int idx; \
        loff_t off = *pos; \
 \
        lock; \
        for (idx = 0; idx < n; idx++) { \
                hlist_for_each(node, &head[idx]) { \
                        if (off-- == 0) { \
                                struct _##head##_seq_iter *iter; \
                                iter = kmalloc(sizeof(*iter), GFP_KERNEL); \
                                if (iter == NULL) \
                                        break; \
                                iter->idx = idx; \
                                iter->node = node; \
                                return iter; \
                        } \
                } \
        } \
 \
        return NULL; \
} \
 \
static void* _##head##_seq_next(struct seq_file *m, void *p, loff_t *pos) \
{ \
        struct _##head##_seq_iter *iter = (struct _##head##_seq_iter*)p; \
 \
        ++*pos; \
        hlist_for_each_continue(iter->node) \
                return iter; \
        for (iter->idx++; iter->idx < n; iter->idx++) { \
                hlist_for_each(iter->node, &head[iter->idx]) \
                        return iter; \
        } \
        kfree(iter); \
 \
        return NULL; \
} \
\
static void _##head##_seq_stop(struct seq_file *m, void *p) \
{ \
        kfree(p); \
        unlock; \
} \
 \
static int _##head##_seq_show(struct seq_file *m, void *p) \
{ \
        struct _##head##_seq_iter *iter = (struct _##head##_seq_iter*)p; \
 \
        return imp_show(m, iter->node); \
} \
 \
__DEFINE_FOO_SEQ_FOPS(prefix, head, _##head##_seq_show, imp_write)

#define DEFINE_HLIST_SEQ_FOPS(head, n, lock, unlock, imp_show, imp_write) \
        _DEFINE_HLIST_SEQ_FOPS(, head, n, lock, unlock, imp_show, imp_write)

#define DEFINE_STATIC_HLIST_SEQ_FOPS(head, n, lock, unlock, imp_show, \
                imp_write) \
        _DEFINE_HLIST_SEQ_FOPS(static, head, n, lock, unlock, imp_show, \
                        imp_write)

/* helper for rbtree */
#define _DEFINE_RBTREE_SEQ_FOPS(prefix, tree, lock, unlock, imp_show, \
                imp_write) \
static void* _##tree##_seq_start(struct seq_file *m, loff_t *pos) \
{ \
        struct rb_node *node; \
        loff_t off = *pos; \
 \
        lock; \
        node = rb_first(&tree); \
        while (off-- > 0 && node != NULL) \
                node = rb_next(node); \
 \
        return node; \
} \
 \
static void* _##tree##_seq_next(struct seq_file *m, void *p, loff_t *pos) \
{ \
        ++*pos; \
        return rb_next((struct rb_node*)p); \
} \
\
static void _##tree##_seq_stop(struct seq_file *m, void *p) \
{ \
        unlock; \
} \
 \
__DEFINE_FOO_SEQ_FOPS(prefix, tree, imp_show, imp_write)

#define DEFINE_RBTREE_SEQ_FOPS(tree, lock, unlock, imp_show, imp_write) \
        _DEFINE_RBTREE_SEQ_FOPS(, tree, lock, unlock, imp_show, imp_write)

#define DEFINE_STATIC_RBTREE_SEQ_FOPS(tree, lock, unlock, imp_show, \
                imp_write) \
        _DEFINE_RBTREE_SEQ_FOPS(static, tree, lock, unlock, imp_show, \
                        imp_write)

#endif /* _SEQ_FILE_HELPER_H_ */


应用起来比较简单,下面是用其重写过的上面的双向连接表导出代码:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/proc_fs.h>

#include "seq_file_helper.h"

static struct mutex lock;
static struct list_head head;
struct my_data {
        struct list_head list;
        int value;
};

static void add_one(void)
{
        struct my_data *data;

        mutex_lock(&lock);
        data = kmalloc(sizeof(*data), GFP_KERNEL);
        if (data != NULL)
                list_add(&data->list, &head);
        mutex_unlock(&lock);
}

static ssize_t seq_write(struct file *file, const char __user * buffer,
                       size_t count, loff_t *ppos)
{
        add_one();

        return count;
}

static int seq_show(struct seq_file *m, void *p)
{
        struct my_data *data = list_entry(p, struct my_data, list);
        seq_printf(m, "value: %d\n", data->value);

        return 0;
}

DEFINE_STATIC_LIST_SEQ_FOPS(head, mutex_lock(&lock), mutex_unlock(&lock),
                seq_show, seq_write);

static void clean_all(struct list_head *head)
{
        struct my_data *data;

        while (!list_empty(head)) {
                data = list_entry(head->next, struct my_data, list);
                list_del(&data->list);
                kfree(data);
        }
}

static int __init init(void)
{
        struct proc_dir_entry *entry;

        mutex_init(&lock);
        INIT_LIST_HEAD(&head);

        add_one();
        add_one();
        add_one();

        entry = create_proc_entry("my_data", S_IWUSR | S_IRUGO, NULL);
        if (entry == NULL) {
                clean_all(&head);
                return -ENOMEM;
        }
        entry->proc_fops = &head_seq_fops;

        return 0;
}

static void __exit fini(void)
{
        remove_proc_entry("my_data", NULL);
        clean_all(&head);
}

module_init(init);
module_exit(fini);


是不是简单了不少了?其它数据结构的实例代码可以参考文后的附件。

参考资料:
  1. http://www.ibm.com/developerworks/cn/linux/l-kerns-usrs2/#N100DA
  2. Linux内核源码

文章出自:http://www.4ucode.com/Study/Topic/1554981


帮我优化检查脚本”#!/bin/bash #SBATCH -p amd_256 # 集群分区(根据实际调整) #SBATCH -N 1 # 单节点运行 #SBATCH -n 32 # 线程数(与集群资源匹配) #SBATCH -t 24:00:00 # 运行时间(可调整) #SBATCH --mem=64G # 内存(可调整) #SBATCH -J juicebox_hic # 作业名 #SBATCH -o /public1/home/scb3201/rawdata/Hic/logs/%j_hic_out.log #SBATCH -e /public1/home/scb3201/rawdata/Hic/logs/%j_hic_err.log ########################################################## # 1. 环境激活与关键工具检查 ########################################################## source ~/.bashrc && conda activate Hic || { echo "[ERROR] Hic环境激活失败!"; exit 1; } # 检查核心工具是否存在 which samtools >/dev/null 2>&1 || { echo "[ERROR] samtools未安装在Hic环境中"; exit 1; } which bedtools >/dev/null 2>&1 || { echo "[ERROR] bedtools未安装在Hic环境中"; exit 1; } which juicer_tools >/dev/null 2>&1 || { echo "[ERROR] juicer_tools未安装在Hic环境中"; exit 1; } ########################################################## # 2. 路径定义与必要目录创建 ########################################################## # 固定路径(根据用户实际路径保持不变) REF_GENOME="/public1/home/scb3201/results/genome/Medaka/genome.nextpolish.medaka.fasta" ALIGN_DIR="/public1/home/scb3201/results/genome/Hic/alignment" SCAFFOLD_DIR="/public1/home/scb3201/results/genome/Hic/scaffold" JUICEBOX_DIR="/public1/home/scb3201/results/genome/Hic/juicebox" LOG_DIR="/public1/home/scb3201/rawdata/Hic/logs" TMP_DIR="/public1/home/scb3201/rawdata/Hic/tmp" # 临时目录(避免磁盘空间不足) # 创建所有依赖目录(确保不存在时自动创建) mkdir -p ${LOG_DIR} ${JUICEBOX_DIR} ${ALIGN_DIR} ${SCAFFOLD_DIR} ${TMP_DIR} || { echo "[ERROR] 目录创建失败"; exit 1; } ########################################################## # 3. 生成染色体大小文件(chrom.sizes) ########################################################## echo "[INFO] 生成染色体大小文件..." if [ ! -f "${REF_GENOME}.fai" ]; then samtools faidx ${REF_GENOME} 2>> ${LOG_DIR}/chrom_sizes.log || { echo "[ERROR] 参考基因组索引生成失败!"; exit 1; } fi cut -f1,2 ${REF_GENOME}.fai > ${JUICEBOX_DIR}/chrom.sizes 2>> ${LOG_DIR}/chrom_sizes.log || { echo "[ERROR] chrom.sizes生成失败!"; exit 1; } #重命名为medaka_genome.chrom.sizes(匹配基因组ID) cp ${JUICEBOX_DIR}/chrom.sizes ${JUICEBOX_DIR}/medaka_genome.chrom.sizes || { echo "[ERROR] 染色体大小文件重命名失败"; exit 1; } ######################################################### # 4. 检查输入BAM文件有效性 ########################################################## echo "[INFO] 检查输入BAM文件..." INPUT_BAM="${ALIGN_DIR}/aligned_sorted.bam" if [ ! -f "${INPUT_BAM}" ]; then echo "[ERROR] 输入BAM文件不存在:${INPUT_BAM}"; exit 1; fi ########################################################## # 5. BAM转bedpe格式(juicer_tools输入要求) ########################################################## echo "[INFO] BAM转换为bedpe格式..." BEDPE_FILE="${JUICEBOX_DIR}/aligned.bedpe" bedtools bamtobed -bedpe -i ${INPUT_BAM} > ${BEDPE_FILE} 2>> ${LOG_DIR}/bam_to_bedpe.log || { echo "[ERROR] BAM转bedpe失败!"; exit 1; } ########################################################## # 6. 生成DPNII限制酶位点文件(适配用户实验酶) ########################################################## echo "[INFO] 生成DPNII酶位点文件(识别位点:GATC)..." ENZYME_FILE="${JUICEBOX_DIR}/dpnII_sites.txt" echo -e "chrom\tsite\tstrand" > ${ENZYME_FILE} # 符合juicer_tools格式要求 # 提取参考基因组中所有GATC位点的字节偏移量(作为位点位置) samtools faidx ${REF_GENOME} | cut -f1 | while read chr; do grep -b -o "GATC" ${REF_GENOME} | awk -v chr=${chr} '{print chr"\t"$1"\t+"}' >> ${ENZYME_FILE} done 2>> ${LOG_DIR}/dpnII_sites.log || { echo "[WARNING] DPNII位点文件生成失败(非必需,可跳过)"; } echo "[INFO] 生成hic文件(DPNII酶适配+染色体文件修正)..." # 切换到JUICEBOX_DIR目录,确保染色体大小文件被找到 cd ${JUICEBOX_DIR} || { echo "[ERROR] 无法切换到JUICEBOX_DIR"; exit 1; } if [ ! -f "medaka_genome.chrom.sizes" ]; then echo "[ERROR] 染色体文件缺失"; exit1; fi if [ ! -r "medaka_genome.chrom.sizes" ]; then echo "[ERROR] 文件无读权限"; exit1; fi ########################################################## # 7. juicer_tools生成hic文件(核心参数修正) ########################################################## echo "[INFO] 生成hic文件(DPNII酶适配版)..." juicer_tools pre \ --threads ${SLURM_NTASKS} \ -t ${TMP_DIR} \ -f dpnII_sites.txt \ aligned.bedpe \ medaka_hic_final.hic \ medaka_genome \ 2>> ${LOG_DIR}/juicer_tools_pre.log || { echo "[ERROR] hic文件生成失败!"; exit 1; } # 正确指定线程数(利用集群资源) # 临时目录(避免磁盘溢出) # DPNII酶位点文件(提升结果分辨率) # 输入bedpe文件 # 输出hic文件 # 自定义基因组ID(唯一即可) ########################################################## # 8. 清理临时文件(释放磁盘空间) ########################################################## echo "[INFO] 清理临时文件..." rm -f ${BEDPE_FILE} 2>> ${LOG_DIR}/cleanup.log || { echo "[WARNING] 临时文件清理失败,不影响结果"; } ########################################################## # 9. MultiQC报告生成(含BAM统计信息) ########################################################## echo "[INFO] 生成MultiQC分析报告..." MULTIQC_DIR="${JUICEBOX_DIR}/multiqc_report" mkdir -p ${MULTIQC_DIR} || { echo "[ERROR] MultiQC目录创建失败"; exit 1; } # 添加BAM统计信息(让报告有实际内容) samtools stats ${INPUT_BAM} > ${LOG_DIR}/bam_stats.txt 2>&1 # 生成报告 multiqc ${LOG_DIR} -o ${MULTIQC_DIR} --title "HI-C Analysis Report" --filename "hic_report.html" 2>> ${LOG_DIR}/multiqc.log || { echo "[WARNING] MultiQC报告生成失败,结果保留"; } ########################################################## # 运行完成提示 ########################################################## echo "[SUCCESS] 全流程运行完成!" echo "结果文件路径:" echo "1. hic文件:${JUICEBOX_DIR}/medaka_hic_final.hic" echo "2. MultiQC报告:${MULTIQC_DIR}/hic_report.html" echo "3. 染色体大小文件:${JUICEBOX_DIR}/chrom.sizes" “我提供参考内容”juicebox_assembly_converter.py usage: juicebox_assembly_converter.py [-h] -a ASSEMBLY -f FASTA [-p PREFIX] [-c] [-s] [-v] juicebox_assembly_converter.py: error: the following arguments are required: -a/--assembly, -f/--fasta (Hic) [scb3201@ln134%bscc-a6 Hic]$ python degap_assembly.py python: can't open file '/public1/home/scb3201/rawdata/Hic/degap_assembly.py': [Errno 2] No such file or directory (Hic) [scb3201@ln134%bscc-a6 Hic]$ degap_assembly.py Traceback (most recent call last): File "/public1/home/scb3201/anaconda3/envs/Hic/bin/degap_assembly.py", line 6, in <module> with open(sys.argv[1]) as file: IndexError: list index out of range (Hic) [scb3201@ln134%bscc-a6 Hic]$ makeAgpFromFasta.py makeAgpFromFasta.py usage: makeAgpFromFasta.py <fasta_file> <agp_out_file> (Hic) [scb3201@ln134%bscc-a6 Hic]$ juicebox_assembly_converter.py -h usage: juicebox_assembly_converter.py [-h] -a ASSEMBLY -f FASTA [-p PREFIX] [-c] [-s] [-v] optional arguments: -h, --help show this help message and exit -a ASSEMBLY, --assembly ASSEMBLY juicebox assembly file -f FASTA, --fasta FASTA the fasta file -p PREFIX, --prefix PREFIX the prefix to use for writing outputs. Default: the assembly file, minus the file extension -c, --contig_mode ignore scaffold specification and just output contigs. useful when only trying to obtain a fasta reflecting juicebox breaks. Default: False -s, --simple_chr_names use simple chromosome names ("ChromosomeX") for scaffolds instead of detailed chromosome names ("PGA_scaffold_X__Y_contigs__length_Z"). Has no effect in contig_mode. -v, --verbose print summary of processing steps to stdout, otherwise silent. Default: True (Hic) [scb3201@ln134%bscc-a6 Hic]$ bwa -h [main] unrecognized command '-h' (Hic) [scb3201@ln134%bscc-a6 Hic]$ bwa Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.19-r1273 Contact: Heng Li <hli@ds.dfci.harvard.edu> Usage: bwa <command> [options] Command: index index sequences in the FASTA format mem BWA-MEM algorithm fastmap identify super-maximal exact matches pemerge merge overlapping paired ends (EXPERIMENTAL) aln gapped/ungapped alignment samse generate alignment (single ended) sampe generate alignment (paired ended) bwasw BWA-SW for long queries (DEPRECATED) shm manage indices in shared memory fa2pac convert FASTA to PAC format pac2bwt generate BWT from PAC pac2bwtgen alternative algorithm for generating BWT bwtupdate update .bwt to the new format bwt2sa generate SA from BWT and Occ Note: To use BWA, you need to first index the genome with `bwa index'. There are three alignment algorithms in BWA: `mem', `bwasw', and `aln/samse/sampe'. If you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for the manual. (Hic) [scb3201@ln134%bscc-a6 Hic]$ samtools Program: samtools (Tools for alignments in the SAM format) Version: 1.22.1 (using htslib 1.22.1) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA fqidx index/extract FASTQ index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates ampliconclip clip oligos from the end of reads -- File operations collate shuffle and group alignments by name cat concatenate BAMs consensus produce a consensus Pileup/FASTA/FASTQ merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA import Converts FASTA or FASTQ files to SAM/BAM/CRAM reference Generates a reference from aligned data reset Reverts aligner changes in reads -- Statistics bedcov read depth per BED region coverage alignment depth and percent coverage depth compute the depth flagstat simple stats idxstats BAM index stats cram-size list CRAM Content-ID and Data-Series sizes phase phase heterozygotes stats generate stats (former bamcheck) ampliconstats generate amplicon specific stats checksum produce order-agnostic checksums of sequence content -- Viewing flags explain BAM flags head header viewer tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM samples list the samples in a set of SAM/BAM/CRAM files -- Misc help [cmd] display this help message or help for [cmd] version detailed version information (Hic) [scb3201@ln134%bscc-a6 Hic]$ bedtools bedtools is a powerful toolset for genome arithmetic. Version: v2.31.1 About: developed in the quinlanlab.org and by many contributors worldwide. Docs: http://bedtools.readthedocs.io/ Code: https://github.com/arq5x/bedtools2 Mail: https://groups.google.com/forum/#!forum/bedtools-discuss Usage: bedtools <subcommand> [options] The bedtools sub-commands include: [ Genome arithmetic ] intersect Find overlapping intervals in various ways. window Find overlapping intervals within a window around an interval. closest Find the closest, potentially non-overlapping interval. coverage Compute the coverage over defined intervals. map Apply a function to a column for each overlapping interval. genomecov Compute the coverage over an entire genome. merge Combine overlapping/nearby intervals into a single interval. cluster Cluster (but don't merge) overlapping/nearby intervals. complement Extract intervals _not_ represented by an interval file. shift Adjust the position of intervals. subtract Remove intervals based on overlaps b/w two files. slop Adjust the size of intervals. flank Create new intervals from the flanks of existing intervals. sort Order the intervals in a file. random Generate random intervals in a genome. shuffle Randomly redistribute intervals in a genome. sample Sample random records from file using reservoir sampling. spacing Report the gap lengths between intervals in a file. annotate Annotate coverage of features from multiple files. [ Multi-way file comparisons ] multiinter Identifies common intervals among multiple interval files. unionbedg Combines coverage intervals from multiple BEDGRAPH files. [ Paired-end manipulation ] pairtobed Find pairs that overlap intervals in various ways. pairtopair Find pairs that overlap other pairs in various ways. [ Format conversion ] bamtobed Convert BAM alignments to BED (& other) formats. bedtobam Convert intervals to BAM records. bamtofastq Convert BAM records to FASTQ records. bedpetobam Convert BEDPE intervals to BAM records. bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals. [ Fasta manipulation ] getfasta Use intervals to extract sequences from a FASTA file. maskfasta Use intervals to mask sequences from a FASTA file. nuc Profile the nucleotide content of intervals in a FASTA file. [ BAM focused tools ] multicov Counts coverage from multiple BAMs at specific intervals. tag Tag BAM alignments based on overlaps with interval files. [ Statistical relationships ] jaccard Calculate the Jaccard statistic b/w two sets of intervals. reldist Calculate the distribution of relative distances b/w two files. fisher Calculate Fisher statistic b/w two feature files. [ Miscellaneous tools ] overlap Computes the amount of overlap from two intervals. igv Create an IGV snapshot batch script. links Create a HTML page of links to UCSC locations. makewindows Make interval "windows" across a genome. groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy") expand Replicate lines based on lists of values in columns. split Split a file into multiple files with equal records or base pairs. summary Statistical summary of intervals in a file. [ General Parameters ] --cram-ref Reference used by a CRAM input [ General help ] --help Print this help menu. --version What version of bedtools are you using?. --contact Feature requests, bugs, mailing lists, etc. (Hic) [scb3201@ln134%bscc-a6 Hic]$ assembly-stats usage: stats [options] <list of fasta/q files> Reports sequence length statistics from fasta and/or fastq files options: -l <int> Minimum length cutoff for each sequence. Sequences shorter than the cutoff will be ignored [1] -s Print 'grep friendly' output -t Print tab-delimited output -u Print tab-delimited output with no header line -v Print version and exit (Hic) [scb3201@ln134%bscc-a6 Hic]$ juicer Usage: juicer <command> <arguments> Version: 1.2.2 Command: pre generate files compatible with Juicebox toolset post generate assembly files after Juicebox curation (Hic) [scb3201@ln134%bscc-a6 Hic]$ juicertools bash: juicertools: command not found... (Hic) [scb3201@ln134%bscc-a6 Hic]$ juicer_tools WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARN [2025-12-06T22:43:06,303] [Globals.java:138] [main] Development mode is enabled Juicer Tools Version 1.22.01 Usage: dump <observed/oe> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize> [outfile] dump <norm/expected> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] dump <loops/domains> <hicFile URL> [outfile] pre [options] <infile> <outfile> <genomeID> addNorm <input_HiC_file> [input_vector_file] pearsons [-p] <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] eigenvector -p <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] apa <hicFile(s)> <PeaksFile> <SaveFolder> arrowhead <hicFile(s)> <output_file> hiccups <hicFile> <outputDirectory> hiccupsdiff <firstHicFile> <secondHicFile> <firstLoopList> <secondLoopList> <outputDirectory> validate <hicFile> -h, --help print help -v, --verbose verbose mode -V, --version print version Type juicer_tools <commandName> for more detailed usage instructions (Hic) [scb3201@ln134%bscc-a6 Hic]$ juicer_tools pre WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARN [2025-12-06T22:43:17,904] [Globals.java:138] [main] Development mode is enabled Usage: juicer_tools pre [options] <infile> <outfile> <genomeID> : -d only calculate intra chromosome (diagonal) [false] : -f <restriction site file> calculate fragment map : -m <int> only write cells with count above threshold m [0] : -q <int> filter by MAPQ score greater than or equal to q [not set] : -c <chromosome ID> only calculate map on specific chromosome [not set] : -r <comma-separated list of resolutions> Only calculate specific resolutions [not set] : -t <tmpDir> Set a temporary directory for writing : -s <statistics file> Add the text statistics file to the Hi-C file header : -g <graphs file> Add the text graphs file to the Hi-C file header : -n Don't normalize the matrices : -z <double> scale factor for hic file : -a <1, 2, 3, 4, 5> filter based on inner, outer, left-left, right-right, tandem pairs respectively : --randomize_position randomize positions between fragment sites : --random_seed <long> for seeding random number generator : --frag_site_maps <fragment site files> for randomization : -k normalizations to include : -j number of CPU threads to use : --threads <int> number of threads : --mndindex <filepath> to mnd chr block indices (Hic) [scb3201@ln134%bscc-a6 Hic]$ juicer_tools dump WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARN [2025-12-06T22:44:15,812] [Globals.java:138] [main] Development mode is enabled Usage: juicer_tools dump <observed/oe> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize> [outfile] dump <norm/expected> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile] dump <loops/domains> <hicFile URL> [outfile]“
最新发布
12-07
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值