ceph源码分析--onode lru

onode是bluestore中的元数据形式,由于bluestore直接写裸盘,因此需要onode来管理对象。本文就讲讲onode的缓存算法。

在bluestore的cache中存在着lru和twoq两种,但是关于onode元数据的cache采用的都是lru算法。

1.lru上已有元素在访问到时怎么到队首?

这部分得从get_onode讲起,在其中调用了

BlueStore::OnodeRef BlueStore::OnodeSpace::lookup(const ghobject_t& oid)

lueStore::OnodeRef BlueStore::Collection::get_onode(
  const ghobject_t& oid,
  bool create)
{
  assert(create ? lock.is_wlocked() : lock.is_locked());

  spg_t pgid;
  if (cid.is_pg(&pgid)) {
    if (!oid.match(cnode.bits, pgid.ps())) {
      lderr(store->cct) << __func__ << " oid " << oid << " not part of "
            << pgid << " bits " << cnode.bits << dendl;
      ceph_abort();
    }
  }

  OnodeRef o = onode_map.lookup(oid);
  if (o)
    return o;

  mempool::bluestore_cache_other::string key;
  get_object_key(store->cct, oid, &key);

  ldout(store->cct, 20) << __func__ << " oid " << oid << " key "
            << pretty_binary_string(key) << dendl;

  bufferlist v;
  int r = store->db->get(PREFIX_OBJ, key.c_str(), key.size(), &v);
  ldout(store->cct, 20) << " r " << r << " v.len " << v.length() << dendl;
  Onode *on;
  if (v.length() == 0) {
    assert(r == -ENOENT);
    if (!store->cct->_conf->bluestore_debug_misc &&
    !create)
      return OnodeRef();

    // new object, new onode
    on = new Onode(this, oid, key);
  } else {
    // loaded
    assert(r >= 0);
    on = new Onode(this, oid, key);
    on->exists = true;
    bufferptr::iterator p = v.front().begin_deep();
    on->onode.decode(p);
    for (auto& i : on->onode.attrs) {
      i.second.reassign_to_mempool(mempool::mempool_bluestore_cache_other);
    }

    // initialize extent_map
    on->extent_map.decode_spanning_blobs(p);
    if (on->onode.extent_map_shards.empty()) {
      denc(on->extent_map.inline_bl, p);
      on->extent_map.decode_some(on->extent_map.inline_bl);
      on->extent_map.inline_bl.reassign_to_mempool(
    mempool::mempool_bluestore_cache_other);
    } else {
      on->extent_map.init_shards(false, false);
    }
  }
  o.reset(on);
  return onode_map.add(oid, o);
}

再关注一下lookup函数,发现在其中当命中了cache时会调用cache中的
void BlueStore::TwoQCache::_touch_onode(OnodeRef& o)

BlueStore::OnodeRef BlueStore::OnodeSpace::lookup(const ghobject_t& oid)
{
  ldout(cache->cct, 30) << __func__ << dendl;
  OnodeRef o;
  bool hit = false;

  {
    std::lock_guard<std::recursive_mutex> l(cache->lock);
    ceph::unordered_map<ghobject_t,OnodeRef>::iterator p = onode_map.find(oid);
    if (p == onode_map.end()) {
      ldout(cache->cct, 30) << __func__ << " " << oid << " miss" << dendl;
    } else {
      ldout(cache->cct, 30) << __func__ << " " << oid << " hit " << p->second
                << dendl;
      cache->_touch_onode(p->second);
      hit = true;
      o = p->second;
    }
  }

  if (hit) {
    cache->logger->inc(l_bluestore_onode_hits);
  } else {
    cache->logger->inc(l_bluestore_onode_misses);
  }
  return o;
}

于是在void BlueStore::TwoQCache::_touch_onode(OnodeRef& o)
将命中后的onode从队列中删除并将其添加到队首

void BlueStore::TwoQCache::_touch_onode(OnodeRef& o)
{
  auto p = onode_lru.iterator_to(*o);
  onode_lru.erase(p);
  onode_lru.push_front(*o);
}

2.lru中元素是怎么增加的?

回到get_onode,该函数有两个参数,当未在onode_map中查找到对应的onode时,当create参数为false时直接返回OnodeRef。而当create参数为true时,则new一个onode对象并调用
BlueStore::OnodeRef BlueStore::OnodeSpace::add(const ghobject_t& oid, OnodeRef o)
在onode_map中添加该onode,并调用cache中的方法
_add_onode(OnodeRef& o, int level)将其加到lru中

BlueStore::OnodeRef BlueStore::OnodeSpace::add(const ghobject_t& oid, OnodeRef o)
{
  std::lock_guard<std::recursive_mutex> l(cache->lock);
  auto p = onode_map.find(oid);
  if (p != onode_map.end()) {
    ldout(cache->cct, 30) << __func__ << " " << oid << " " << o
              << " raced, returning existing " << p->second
              << dendl;
    return p->second;
  }
  ldout(cache->cct, 30) << __func__ << " " << oid << " " << o << dendl;
  onode_map[oid] = o;
  cache->_add_onode(o, 1);
  return o;
}

将onode加入到lru中,源码中调用处的level都为1,所以都放在队首

void _add_onode(OnodeRef& o, int level) override {
  if (level > 0)
    onode_lru.push_front(*o);
  else
    onode_lru.push_back(*o);
}

3.什么时候对lru做trim??

void *BlueStore::MempoolThread::entry(),在其中能看到trim是定时启动的,其间隔时间是bluestore_cache_trim_interval默认是0.2s。从中也能看出,一个osd不只是有一个lru的onode,而是有多个lru。一个shard对应一个lru,默认hdd是有5个lru,而ssd稍多有8个。

void *BlueStore::MempoolThread::entry()
{
  Mutex::Locker l(lock);
  while (!stop) {
    uint64_t meta_bytes =
      mempool::bluestore_cache_other::allocated_bytes() +
      mempool::bluestore_cache_onode::allocated_bytes();
    uint64_t onode_num =
      mempool::bluestore_cache_onode::allocated_items();

    if (onode_num < 2) {
      onode_num = 2;
    }

    float bytes_per_onode = (float)meta_bytes / (float)onode_num;
    size_t num_shards = store->cache_shards.size();
    float target_ratio = store->cache_meta_ratio + store->cache_data_ratio;
    // A little sloppy but should be close enough
    uint64_t shard_target = target_ratio * (store->cache_size / num_shards);

    for (auto i : store->cache_shards) {
      i->trim(shard_target,
          store->cache_meta_ratio,
          store->cache_data_ratio,
          bytes_per_onode);
    }

    store->_update_cache_logger();

    utime_t wait;
    wait += store->cct->_conf->bluestore_cache_trim_interval;
    cond.WaitInterval(lock, wait);
  }
  stop = false;
  return NULL;
}

关注trim函数,可得到当满足current > target_bytes时才会去调用cache中的_trim

void BlueStore::Cache::trim(
  uint64_t target_bytes,
  float target_meta_ratio,
  float target_data_ratio,
  float bytes_per_onode)
{
  std::lock_guard<std::recursive_mutex> l(lock);
  uint64_t current_meta = _get_num_onodes() * bytes_per_onode;
  uint64_t current_buffer = _get_buffer_bytes();
  uint64_t current = current_meta + current_buffer;

  uint64_t target_meta = target_bytes * target_meta_ratio;
  uint64_t target_buffer = target_bytes * target_data_ratio;

  // correct for overflow or float imprecision
  target_meta = min(target_bytes, target_meta);
  target_buffer = min(target_bytes - target_meta, target_buffer);

  if (current <= target_bytes) {
    dout(10) << __func__
         << " shard target " << pretty_si_t(target_bytes)
         << " meta/data ratios " << target_meta_ratio
         << " + " << target_data_ratio << " ("
         << pretty_si_t(target_meta) << " + "
         << pretty_si_t(target_buffer) << "), "
         << " current " << pretty_si_t(current) << " ("
         << pretty_si_t(current_meta) << " + "
         << pretty_si_t(current_buffer) << ")"
         << dendl;
    return;
  }

  uint64_t need_to_free = current - target_bytes;
  uint64_t free_buffer = 0;
  uint64_t free_meta = 0;
  if (current_buffer > target_buffer) {
    free_buffer = current_buffer - target_buffer;
    if (free_buffer > need_to_free) {
      free_buffer = need_to_free;
    }
  }
  free_meta = need_to_free - free_buffer;

  // start bounds at what we have now
  uint64_t max_buffer = current_buffer - free_buffer;
  uint64_t max_meta = current_meta - free_meta;
  uint64_t max_onodes = max_meta / bytes_per_onode;

  dout(10) << __func__
       << " shard target " << pretty_si_t(target_bytes)
       << " ratio " << target_meta_ratio << " ("
       << pretty_si_t(target_meta) << " + "
       << pretty_si_t(target_buffer) << "), "
       << " current " << pretty_si_t(current) << " ("
       << pretty_si_t(current_meta) << " + "
       << pretty_si_t(current_buffer) << "),"
       << " need_to_free " << pretty_si_t(need_to_free) << " ("
       << pretty_si_t(free_meta) << " + "
       << pretty_si_t(free_buffer) << ")"
       << " -> max " << max_onodes << " onodes + "
       << max_buffer << " buffer"
       << dendl;
  _trim(max_onodes, max_buffer);
}

_trim函数前半部分是数据的cache,这部分不关注略去

void BlueStore::TwoQCache::_trim(uint64_t onode_max, uint64_t buffer_max)
{
  dout(20) << __func__ << " onodes " << onode_lru.size() << " / " << onode_max
       << " buffers " << buffer_bytes << " / " << buffer_max
       << dendl;

  _audit("trim start");

  ···

  // onodes
  //当lru的大小大于onode的最大值进行trim
  int num = onode_lru.size() - onode_max;
  if (num <= 0)
    return; // don't even try
  //从后往前trim,因为队尾是较久未访问的数据
  auto p = onode_lru.end();
  assert(p != onode_lru.begin());
  --p;
  int skipped = 0;
  int max_skipped = g_conf->bluestore_cache_trim_max_skip_pinned;
  while (num > 0) {
    Onode *o = &*p;
    dout(20) << __func__ << " considering " << o << dendl;
    int refs = o->nref.load();
    //查看是否有其他人在使用
    if (refs > 1) {
      dout(20) << __func__ << "  " << o->oid << " has " << refs
           << " refs; skipping" << dendl;
      //这部分达到最大跳过数就结束trim,我认为本处的目的是为了尽快结束trim,
      //避免影响主流程io,因为在主流程和本处trim时操作lru和相关数据时
      //都加了锁std::lock_guard<std::recursive_mutex> l(lock);
      if (++skipped >= max_skipped) {
        dout(20) << __func__ << " maximum skip pinned reached; stopping with "
                 << num << " left to trim" << dendl;
        break;
      }

      if (p == onode_lru.begin()) {
        break;
      } else {
        p--;
        num--;
        continue;
      }
    }
    dout(30) << __func__ << " " << o->oid << " num=" << num <<" lru size="<<onode_lru.size()<< dendl;
    //从lru上摘除
    if (p != onode_lru.begin()) {
      onode_lru.erase(p--);
    } else {
      onode_lru.erase(p);
      assert(num == 1);
    }
    o->get();  // paranoia
    //从onode_map中删除
    o->c->onode_map.remove(o->oid);
    o->put();
    --num;
  }
}
<think> 嗯,用户遇到了ceph-csi服务组件(cephfs和rbd)的pod持续CrashLoopBackOff和Error状态的问题,这通常是CSI驱动部署中的典型故障。结合用户提供的四个引用内容,需要系统性排查。 引用[1]提到ceph-csi的三大服务类型和Controller/Node/Identity三个核心组件的职责,这提示排查时需要区分组件类型。引用[2]中PVC卡在pending状态的根本原因是MDS服务未启动,说明底层Ceph集群状态直接影响CSI表现。引用[3]展示了成功的rbd部署案例,可对比组件版本差异。引用[4]的官方文档链接则暗示配置准确性至关重要。 用户深层需求显然是恢复集群存储功能,而非单纯解决pod状态。因此方案设计要覆盖: 1. 核心依赖检查:优先确认Ceph集群健康(引用[2]的教训),特别是MDS(cephfs必需)和OSD(rbd必需)状态 2. 权限配置复查:引用[1]提到ControllerServer需要cephx密钥操作存储池,密钥错误会导致组件崩溃 3. 版本兼容性:CSI版本需匹配K8s和Ceph版本(引用[3]成功案例的提示) 4. 日志定位:CrashLoopBackOff必定产生错误日志,需按组件分类获取 特别要注意引用[1]中livenessType组件的探活机制——如果socket配置错误,即使主服务正常也会被误判死亡。这可能是部分pod反复重启的隐藏原因。 针对用户提供的错误现象,下面给出具体解决步骤: </think> 基于引用内容和常见故障排查经验,解决ceph-csi-cephfs/ceph-csi-rbd的Pod处于CrashLoopBackOff或Error状态的方法如下(按优先级排查): --- ### 🔍 **核心排查步骤** #### 1️⃣ **检查Ceph集群基础服务** ```bash # 在Ceph管理节点执行 ceph -s # 确认集群状态为HEALTH_OK ceph mds stat # 检查cephfs必需的MDS服务(引用[2]关键点) ceph osd stat # 检查OSD状态 ceph mon stat # 检查Mon节点 ``` - **关键点**:若MDS未运行(cephfs问题)或OSD异常(rbd问题),CSI组件必然失败(引用[2]已验证) #### 2️⃣ **检查CSI组件日志(首要操作)** ```bash kubectl logs -f <故障Pod名称> -c <具体容器名> # 示例(引用[3]组件结构): kubectl logs -f csi-rbdplugin-provisioner-xxx -c csi-provisioner ``` - **常见日志线索**: - `connection refused` → Ceph Monitor地址/端口错误 - `permission denied` → Ceph密钥问题 - `no mds server is up` → MDS服务未启动(引用[2]的直接原因) - `Volume ID already exists` → 残留冲突(引用[2]的报错) #### 3️⃣ **验证核心配置文件** - **检查`ceph-conf.yaml`(引用[4])**: ```yaml # 确认monitor地址与Ceph集群实际IP和端口匹配(默认6789) monitors: - 10.0.0.1:6789 - 10.0.0.2:6789 ``` - **检查密钥配置**: ```bash # 查看Secret内容是否正确 kubectl get secret ceph-csi-secret -o yaml ``` - 密钥需通过`ceph auth get-key client.admin`获取 #### 4️⃣ **排查存储类(StorageClass)配置** - **检查pool名称是否存在**: ```bash ceph osd lspools # 确认rbd/cephfs池已创建 ``` - **验证StorageClass参数**: ```yaml parameters: clusterID: <fsid> # 必须与ceph -s显示的fsid一致 pool: mypool # 真实存在的池名称 ``` #### 5️⃣ **检查Kubernetes节点兼容性** ```bash # 在K8s节点验证内核模块 lsmod | grep rbd # rbd需要加载内核模块 mount.ceph --version # 确认ceph-common工具已安装 ``` --- ### ⚠️ **高频问题解决方案** | 问题类型 | 解决方案 | |-------------------------|--------------------------------------------------------------------------| | **MDS未启动(cephfs)** | 在Ceph集群执行: `ceph-deploy mds create <节点名>`(引用[2]的直接解决方案) | | **Monitor端口错误** | 核对`ceph-conf.yaml`中端口号(非默认端口需显式指定) | | **密钥权限问题** | 重新生成Secret:`kubectl create secret generic ceph-csi-secret --from-literal=userKey=$(ceph auth get-key client.admin)` | | **残留PV冲突** | 删除冲突PV:`kubectl delete pvc <pvc-name> --force --grace-period=0`(引用[2]的错误场景) | | **CSI驱动版本不兼容** | 检查[官方兼容矩阵](https://github.com/ceph/ceph-csi/#supported-versions) 📌 | --- ### 📌 **补充建议** 1. **版本对齐**: 确保Ceph集群版本、ceph-csi镜像版本、Kubernetes版本三者兼容(引用[3][4]的部署实践) ```bash kubectl describe pod csi-rbdplugin-xxx | grep Image ``` 2. **存活探针调优**: 若日志提示`liveness probe failed`(引用[1]的livenessType),适当增加超时时间: ```yaml livenessProbe: timeoutSeconds: 10 # 默认值可能不足 ``` --- ### 📚 **相关问题** 1. 如何诊断 Kubernetes 中 PersistentVolumeClaim (PVC) 长时间处于 Pending 状态的问题? 2. Ceph CSI 部署中常见哪些与权限(cephx)相关的错误?如何解决? 3. 当 Ceph 集群状态为 `HEALTH_WARN` 时,哪些警告可能影响 CSI 正常工作? 4. 如何验证 ceph-csi 组件与 Ceph 集群之间的网络连通性? > 引用说明: > [^1]: ceph-csi组件架构与Controller/Node/Identity服务职责 > [^2]: MDS服务未启动导致PVC卡Pending的关键案例 > [^3]: 成功部署的组件状态参考 > [^4]: 官方部署配置文件来源
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值