Linux 内核中RAID5源码详解之stripe_head的管理

最新推荐文章于 2024-07-14 18:44:12 发布

小表弟皮卡丘

最新推荐文章于 2024-07-14 18:44:12 发布

阅读量4.3k

点赞数 1

分类专栏： md raid 存储 raid5 linux kernel 文章标签：源码内核 raid5 存储 linux

本文链接：https://blog.youkuaiyun.com/chenyouxu/article/details/47040281

版权

linux kernel 同时被 3 个专栏收录

10 篇文章

订阅专栏

raid

9 篇文章

订阅专栏

存储

8 篇文章

订阅专栏

Linux 内核中RAID5源码详解之stripe_head的管理

前面已经介绍了整个系统的全局架构和内核中RAID5的基本处理单元stripe_head结构，基本上已经从整体上对Linux内核中的RAID5模块有了一定的认识，今天我们就来说说RAID5是怎么来管理stripe_head(下面有时也会说到”条带“，其实指的就是stripe_head结构)的，闲话不多说，go~

stripe_head的管理

我们已经知道内核中的RAID5对请求的处理，是以stripe_head结构为基本单元进行的，stripe_head结构在内存中的表示可以描述成下图：

这是一个代表3+1模式的RAID5结构，其中我们可以看出除了一些元数据，剩下的就是dev设备的缓冲区了，相关的结构定义可在raid5.h中查阅。值得注意的是元数据中有一个域：sector ，这个值代表了这条stripe_head在RAID5中的偏移量，并且这个偏移量也是后面4个设备缓冲区对其的标志，及后面4个设备的缓冲区在当前设备上的偏移量也是sector，是确定下来的。
前面已经介绍了stripe_head的state状态，每一个状态代表什么意思，以及相应的RAID5的全局配置信息r5conf，这些都可参照我的博文Linux内核中RAID5的基本架构与数据结构解析，那么接下来我们就来谈谈内核中是怎么实现对stripe_head的管理的。

其实用一句话简要概括就是根据stripe_head的状态将其放入不同的list中，而list正是r5conf中的handle_list、hold_list、delayed_list，其中还有inactive_list和temp_inactive_list。 每个list所代表的含义如下：

handle_list : 需要处理的stripe_head集合
hold_list : 预读状态就绪的stripe_head集合
delayed_list：需要延迟处理的stripe_head集合，因为缺少处理的条件
inactive_list：不活跃的stripe_head集合
temp_inactive_list：inactive_list的缓存

接下来我们从一个stripe_head的诞生到消亡来一步步观察内核是怎么对其进行管理的。

首先当一个请求到来时，RAID5中的入口函数为make_request() ，这里我们主要讲stripe_head的管理，所以其他的地方不会讲的很细，基本上就讲到个什么意思，以后会再详细讲下其中的处理细节。在make_request() 中有这样一段语句：

/*make_request()*/
    for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
        int previous;
        int seq;

        do_prepare = false;
    retry:
        seq = read_seqcount_begin(&conf->gen_lock);
        previous = 0;
        if (do_prepare)
            prepare_to_wait(&conf->wait_for_overlap, &w,
                TASK_UNINTERRUPTIBLE);
        if (unlikely(conf->reshape_progress != MaxSector)) {
            /* spinlock is needed as reshape_progress may be
             * 64bit on a 32bit platform, and so it might be
             * possible to see a half-updated value
             * Of course reshape_progress could change after
             * the lock is dropped, so once we get a reference
             * to the stripe that we think it is, we will have
             * to check again.
             */
            spin_lock_irq(&conf->device_lock);
            if (mddev->reshape_backwards
                ? logical_sector < conf->reshape_progress
                : logical_sector >= conf->reshape_progress) {
                previous = 1;
            } else {
                if (mddev->reshape_backwards
                    ? logical_sector < conf->reshape_safe
                    : logical_sector >= conf->reshape_safe) {
                    spin_unlock_irq(&conf->device_lock);
                    schedule();
                    do_prepare = true;
                    goto retry;
                }
            }
            spin_unlock_irq(&conf->device_lock);
        }

        new_sector = raid5_compute_sector(conf, logical_sector,
                          previous,
                          &dd_idx, NULL);//计算偏移量
        pr_debug("raid456: make_request, sector %llu logical %llu\n",
            (unsigned long long)new_sector,
            (unsigned long long)logical_sector);

        sh = get_active_stripe(conf, new_sector, previous,
                       (bi->bi_rw&RWA_MASK), 0);//取条带

这里我们可以注意到第1行的for循环，这是为了将bio请求切片，切成一个stripe_head的处理单位，即默认的情况下是page大小，因为每个stripe_head在每块盘上的处理单元是一个page的大小，默认为4KB。中间的部分暂时忽略，而raid5_compute_sector() 则是计算这个切出来的page大小的bio请求在盘上的偏移量，怎么计算的我们暂时不需要知道，后面会专门讲解。通过raid5_compute_sector() 计算后得到当前盘上的偏移量new_sector，我们前面说过了一个stripe_head所对应的盘上的偏移量都是一样的，所以这个bio的切片会和其他盘上具有new_sector的偏移量的数据构成一个stripe_head的dev的缓冲区域。

获取stripe_head结构

那么接下来就通过new_sector这个偏移量来获得stripe_head，get_active_stripe() 就是干这个事的，跟进get_active_stripe() ：

static struct stripe_head *
get_active_stripe(struct r5conf *conf, sector_t sector,
          int previous, int noblock, int noquiesce)
{
    struct stripe_head *sh;
    int hash = stripe_hash_locks_hash(sector);

    pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);

    spin_lock_irq(conf->hash_locks + hash);

    do {
        wait_event_lock_irq(conf->wait_for_stripe,
                    conf->quiesce == 0 || noquiesce,
                    *(conf->hash_locks + hash));
        sh = __find_stripe(conf, sector, conf->generation - previous);
        if (!sh) {
            if (!conf->inactive_blocked)
                sh = get_free_stripe(conf, hash);
            if (noblock && sh == NULL)
                break;
            if (!sh) {
                conf->inactive_blocked = 1;
                wait_event_lock_irq(
                    conf->wait_for_stripe,
                    !list_empty(conf->inactive_list + hash) &&
                    (atomic_read(&conf->active_stripes)
                     < (conf->max_nr_stripes * 3 / 4)
                     || !conf->inactive_blocked),
                    *(conf->hash_locks + hash));
                conf->inactive_blocked = 0;
            } else {
                init_stripe(sh, sector, previous);
                atomic_inc(&sh->count);
            }
        } else if (!atomic_inc_not_zero(&sh->count)) {
            spin_lock(&conf->device_lock);
            if (!atomic_read(&sh->count)) {
                if (!test_bit(STRIPE_HANDLE, &sh->state))
                    atomic_inc(&conf->active_stripes);
                BUG_ON(list_empty(&sh->lru) &&
                       !test_bit(STRIPE_EXPANDING, &sh->state));
                list_del_init(&sh->lru);
                if (sh->group) {
                    sh->group->stripes_cnt--;
                    sh->group = NULL;
                }
            }
            atomic_inc(&sh->count);
            spin_unlock(&conf->device_lock);
        }
    } while (sh == NULL);

    spin_unlock_irq(conf->hash_locks + hash);
    return sh;
}

前面说过RAID5默认的stripe_head的个数为256，它们之间的区别都是靠sector域来区分的，所以stripe_head是循环使用的，而为了更有效率的管理stripe_head，RAID5引进了HASH链表和LRU链表，通过每一个sector(前面计算得到的new_sector的值)计算对应的hash值，然后再HASH链表或者LRU链表中查找stripe_head，这样更有效率。
观察上述的代码，先计算sector对应的hash值，接下来是一个do-while的大循环，直到sh != NULL 时循环才结束，意思就是一定要取到个stripe_head，不取到stripe_head就赖着不走了，好好好，咱们不耍流氓，那就来看看怎么取的：
1. __find_stripe()先根据hash值在hash链表中查找stripe_head,跟进__find_stripe() :

static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
                     short generation)
{
    struct stripe_head *sh;

    pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
    hlist_for_each_entry(sh, stripe_hash(conf, sector), hash)
        if (sh->sector == sector && sh->generation == generation)
            return sh;
    pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
    return NULL;
}

这很明显是一次遍历HASH链表，直到遇到sector相同的，则返回，代表了具有sector偏移量的stripe_head已经在使用了，不需要再重新拿一个空的stripe_head来运行，这也是sector的唯一性。
2. 回到get_active_stripe() ，如果__find_stripe() 失败，则没用正在使用的偏移量为sector的stripe_head，这时则调用get_free_stripe() 来获取个空的stripe_head,第一次时由于活跃的stripe_head为空，所以肯定进入get_free_stripe() ,跟进get_free_stripe() ：

/* find an idle stripe, make sure it is unhashed, and return it. */
static struct stripe_head *get_free_stripe(struct r5conf *conf, int hash)
{
    struct stripe_head *sh = NULL;
    struct list_head *first;

    if (list_empty(conf->inactive_list + hash))
        goto out;
    first = (conf->inactive_list + hash)->next;//获得对应的list_head
    sh = list_entry(first, struct stripe_head, lru);//在LRU链表中获取相应的stripe_head
    list_del_init(first);//将其删除
    remove_hash(sh);//将sh从hash链表中删除
    atomic_inc(&conf->active_stripes);//将RAID5的活跃stripe_head数加1
    BUG_ON(hash != sh->hash_lock_index);
    if (list_empty(conf->inactive_list + hash))//删除后对应的位置上没有list，则将empty_inactive_nr加1
        atomic_inc(&conf->empty_inactive_list_nr);
out:
    return sh;
}

首先进入if判断，根据hash值在RAID5的inactive_list查找相应的位置，由于在进入该函数之前，__find_stripe() 是返回NULL的，所以相应的stripe_head按道理来说应该是在inactive_list中的，如果if成立，即inactive_list没有相应的stripe_head，那么则跳到out：返回NULL；否则获取相应的stripe_head，并从LRU和HASH链表中删除。
3. 获得了stripe_head结构后，返回到get_active_stripe()，但是此时的stripe_head结构是空的，需要初始化，于是调用init_stripe() 对stripe_head设置相应的元数据，跟进init_stripe() :

static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
{
    struct r5conf *conf = sh->raid_conf;
    int i, seq;

    BUG_ON(atomic_read(&sh->count) != 0);
    BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
    BUG_ON(stripe_operations_active(sh));

    pr_debug("init_stripe called, stripe %llu\n",
        (unsigned long long)sector);
retry:
    seq = read_seqcount_begin(&conf->gen_lock);
    sh->generation = conf->generation - previous;
    sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;//设置条带中设备的数目，即盘数
    sh->sector = sector;//设置条带的偏移量
    stripe_set_idx(sector, conf, previous, sh);//设置条带中校验盘的盘号等等
    sh->state = 0;//状态值为空

    for (i = sh->disks; i--; ) {//对每一个设备缓冲区进行操作
        struct r5dev *dev = &sh->dev[i];

        if (dev->toread || dev->read || dev->towrite || dev->written ||test_bit(R5_LOCKED, &dev->flags)) {/*由于是空条带，所以该缓冲区中的请求链表一定全为空，而且不能上锁*/
            printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
                   (unsigned long long)sh->sector, i, dev->toread,
                   dev->read, dev->towrite, dev->written,
                   test_bit(R5_LOCKED, &dev->flags));
            WARN_ON(1);
        }
        dev->flags = 0;
        raid5_build_block(sh, i, previous);/*为每个缓冲区设置相应的偏移量，注意虽然条带中的缓冲区在相应的盘上具有相同的偏移量，但是在整个RAID5的地址空间中，这些缓冲区的偏移量是不一样的*/
    }
    if (read_seqcount_retry(&conf->gen_lock, seq))
        goto retry;
    insert_hash(conf, sh);//再将条带插入到HASH链表中
    sh->cpu = smp_processor_id();
}

相应的注释已经标明了，这就是设置相应的元数据信息。
4. 回到get_active_stripe() 中，再设置些全局的元数据就结束了，到这里已经取到了相应的stripe_head结构，并设置了相应的元数据。

stripe_head的状态转移

在获得了stripe_head(设为sh)后，对其的操作主要是通过set_bit(&sh->state) 和 clear_bit(&sh->state) 对其状态进行设置，get_active_stripe() 返回后，在make_request() 中继续执行，进入如下区域：

if (test_bit(STRIPE_EXPANDING, &sh->state) ||
                !add_stripe_bio(sh, bi, dd_idx, rw)){/*add_stripe_bio: 将bio请求加入到该stripe_head中*/
                /* Stripe is busy expanding or
                 * add failed due to overlap.  Flush everything
                 * and wait a while
                 */
                md_wakeup_thread(mddev->thread);
                release_stripe(sh);
                schedule();
                do_prepare = true;
                goto retry;
}
set_bit(STRIPE_HANDLE, &sh->state);//将条带设置为需要处理标志
clear_bit(STRIPE_DELAYED, &sh->state);//清除条带的延迟处理标志
if ((bi->bi_rw & REQ_SYNC) &&
     !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
                atomic_inc(&conf->preread_active_stripes);
release_stripe_plug(mddev, sh);//将条带加入到不同的list中进行处理

这里我们主要研究stripe_head的管理，所以对于add_stripe_bio() 我们只要需要知道这是讲bio加入到条带中的就可以了，而真正处理stripe_head的地方在release_stripe_plug() 中，至此，make_request() 中对stripe_head的处理已经结束，下面我们将战场转移到release_stripe_plug() 中。跟进：

static void release_stripe_plug(struct mddev *mddev,
                struct stripe_head *sh)
{
    struct blk_plug_cb *blk_cb = blk_check_plugged(
        raid5_unplug, mddev,
        sizeof(struct raid5_plug_cb));
    struct raid5_plug_cb *cb;

    if (!blk_cb) {
        release_stripe(sh);
        return;
    }

    cb = container_of(blk_cb, struct raid5_plug_cb, cb);

    if (cb->list.next == NULL) {
        int i;
        INIT_LIST_HEAD(&cb->list);
        for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
            INIT_LIST_HEAD(cb->temp_inactive_list + i);
    }

    if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
        list_add_tail(&sh->lru, &cb->list);//添加到LRU链表中
    else
        release_stripe(sh);
}

仔细研究代码，发现最后的出口函数都是release_stripe()，这也是真正处理stripe_head的接口，跟进release_stripe() :

static void release_stripe(struct stripe_head *sh)
{
    struct r5conf *conf = sh->raid_conf;
    unsigned long flags;
    struct list_head list;
    int hash;
    bool wakeup;

    /* Avoid release_list until the last reference.
     */
    if (atomic_add_unless(&sh->count, -1, 1))
        return;

    if (unlikely(!conf->mddev->thread) ||
        test_and_set_bit(STRIPE_ON_RELEASE_LIST, &sh->state))
        goto slow_path;
    wakeup = llist_add(&sh->release_list, &conf->released_stripes);
    if (wakeup)
        md_wakeup_thread(conf->mddev->thread);
    return;
slow_path:
    local_irq_save(flags);
    /* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
    if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
        INIT_LIST_HEAD(&list);
        hash = sh->hash_lock_index;
        do_release_stripe(conf, sh, &list);//根据不同state，将sh放入不同的链表
        spin_unlock(&conf->device_lock);
        release_inactive_stripe_list(conf, &list, hash);//整理inactive_stripe_list
    }
    local_irq_restore(flags);
}

看东西要看重点，这里的重点就是do_release_stripe()，注释已经说明了它的作用，跟进去：

static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
                  struct list_head *temp_inactive_list)
{
    BUG_ON(!list_empty(&sh->lru));
    BUG_ON(atomic_read(&conf->active_stripes)==0);
    if (test_bit(STRIPE_HANDLE, &sh->state)) {//sh需要处理
        if (test_bit(STRIPE_DELAYED, &sh->state) &&
            !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
            list_add_tail(&sh->lru, &conf->delayed_list);//延迟处理，加入delayed_list
        else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
               sh->bm_seq - conf->seq_write > 0)
            list_add_tail(&sh->lru, &conf->bitmap_list);
        else {
            clear_bit(STRIPE_DELAYED, &sh->state);//清除延迟处理状态
            clear_bit(STRIPE_BIT_DELAY, &sh->state);//清除等待bitmap处理状态
            if (conf->worker_cnt_per_group == 0) {
                list_add_tail(&sh->lru, &conf->handle_list);//加入handle_list
            } else {
                raid5_wakeup_stripe_thread(sh);
                return;
            }
        }
        md_wakeup_thread(conf->mddev->thread);
    } else {//不需要处理，则回收
        BUG_ON(stripe_operations_active(sh));
        if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
            if (atomic_dec_return(&conf->preread_active_stripes)
                < IO_THRESHOLD)
                md_wakeup_thread(conf->mddev->thread);//唤醒守护进程
        atomic_dec(&conf->active_stripes);//将活跃条带数减1
        if (!test_bit(STRIPE_EXPANDING, &sh->state))
            list_add_tail(&sh->lru, temp_inactive_list);//加入非活跃list
    }
}

主要注释已经给出，就是根据不同的状态加入到不同的list中，而且do_release_stripe() 还给出了回收stripe_head的操作，那么回收stripe_head的入口在哪呢？

stripe_head的回收

stripe_head在处理完后，需要进程回收操作，因为整个RAID5中只有256个stripe_head，而回收操作的入口在哪呢？既然do_release_stripe() 是处理回收的地方，那么我们就全局搜索调用这个函数的地方，发现__release_stripe() ，没错，这就是回收的入口，有关回收操作的具体步骤，上述注释已经给出，就几行，很简单。值得注意的是回收操作并没有把stripe_head中的dev缓冲区清空，因为只要缓冲区的数据还在，那么下次读请求到来时就不需要再重新从盘上读了，以提高性能。

总结

其实对stripe_head的管理，只需要知道stripe_head中的state状态决定了stripe_head的存放位置以及相应的操作，也许画个图会更好的理解。
这里写图片描述
细心的会发现一直没有说hold_list，其实那个很简单，就把它看成delayed_list到handle_list的中介吧，即如果在delayed_list上的stripe_head得到相应的条件后会先加入到hold_list中，然后再取条带进行处理时，如果handle_list上的条带数不足，则从hold_list上进行补充。到这里，希望能对你理解内核中RAID5的stripe_head管理有一定的帮助。