JVM G1源码分析——Refine线程

学海_无涯_苦作舟

已于 2023-09-23 20:30:48 修改

阅读量681

点赞数

分类专栏： # JVM 文章标签： jvm java 开发语言

于 2023-09-22 20:13:18 首次发布

本文链接：https://blog.youkuaiyun.com/qq_16500963/article/details/133184450

版权

JVM 专栏收录该内容

13 篇文章

订阅专栏

Refine线程是G1新引入的并发线程池，线程默认数目为G1ConcRefinementThreads+1，它分为两大功能：

用于处理新生代分区的抽样，并且在满足响应时间的这个指标下，更新YHR的数目。通常有一个线程来处理。
管理RSet，这是Refine最主要的功能。RSet的更新并不是同步完成的，G1会把所有的引用关系都先放入到一个队列中，称为dirty card queue（DCQ），然后使用线程来消费这个队列以完成更新。正常来说有G1ConcRefinementThreads个线程处理；实际上除了Refine线程更新RSet之外，GC线程或者Mutator也可能会更新RSet；DCQ通过Dirty Card Queue Set（DCQS）来管理；为了能够并发地处理，每个Refine线程只负责DCQS中的某几个DCQ。

对于处理DirtyCard的Refine线程有两个关注点：Mutator如何把引用对象放入到DCQS供Refine线程处理，以及当Refine线程太忙的话Mutator如何帮助线程。我们先介绍比较独立的抽样线程，再介绍一般的Refine线程。

抽样线程

Refine线程池中的最后一个线程就是抽样线程，它的主要作用是设置新生代分区的个数，使G1满足垃圾回收的预测停顿时间。抽样线程的代码在run_young_rs_sampling，如下所示：

void ConcurrentG1RefineThread::run_young_rs_sampling() {
  DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
  _vtime_start = os::elapsedVTime();
  while(!_should_terminate) {
    sample_young_list_rs_lengths();

    if (os::supports_vtime()) {
      _vtime_accum = (os::elapsedVTime() - _vtime_start);
    } else {
      _vtime_accum = 0.0;
    }

    MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
    if (_should_terminate) {
      break;
    }
    /*可以看到这里使用参数G1ConcRefinementServiceIntervalMillis控制抽样线程运行的频度，
    生产中如果发现采样不足可以减少该时间，如果系统运行稳定满足预测时间，可以增大该值减少采样*/
    _monitor->wait(Mutex::_no_safepoint_check_flag, G1ConcRefinementServiceIntervalMillis);
  }
}

hotspot/src/share/vm/gc_implementation/g1/concurrentG1RefineThread.cpp
void ConcurrentG1RefineThread::sample_young_list_rs_lengths() {
  SuspendibleThreadSetJoiner sts;
  G1CollectedHeap* g1h = G1CollectedHeap::heap();
  G1CollectorPolicy* g1p = g1h->g1_policy();
  if (g1p->adaptive_young_list_length()) {
    int regions_visited = 0;
g1h->young_list()->rs_length_sampling_init();
// young_list是所有新生代分区形成的一个链表
while (g1h->young_list()->rs_length_sampling_more()) {
/*这里的关键是rs_length_sampling_next，其值为在本次循环中有多少个分区可以加入到新生代分区，
其思路也非常简单：当前分区有多少个引用的分区，包括稀疏、细粒度和粗粒度的分区个数，把这个数字
加入到新生代总回收的要处理的分区数目。从这里也可以看到停顿时间指回收新生代分区要花费的时间，
这个时间当然也包括分区之间引用的处理*/
      g1h->young_list()->rs_length_sampling_next();
      ++regions_visited;
      // 每10次即每处理10个分区，主动让出CPU，目的是为了在GC发生时VMThread
      // 能顺利进入到安全点，关于进入安全点的详细解释参见第10章
      if (regions_visited == 10) {
        if (sts.should_yield()) {
          sts.yield();
          break;
        }
        regions_visited = 0;
      }
    }
    // 这里就是利用上面的抽样数据更新新生代分区数目
    g1p->revise_young_list_target_length_if_necessary();
  }
}

修正新生代分区数目的代码如下所示：

src\share\vm\gc_implementation\g1\g1CollectorPolicy.cpp

void G1CollectorPolicy::revise_young_list_target_length_if_necessary() {
  guarantee( adaptive_young_list_length(), "should not call this otherwise" );

  size_t rs_lengths = _g1->young_list()->sampled_rs_lengths();
  if (rs_lengths > _rs_lengths_prediction) {
    // add 10% to avoid having to recalculate often
    size_t rs_lengths_prediction = rs_lengths * 1100 / 1000;
    update_young_list_target_length(rs_lengths_prediction);
  }
}

具体的计算方式在update_young_list_target_length，传递的参数就是我们采样得到的分区数目。在预测时，还需要考虑最小分区的下限和上限，不过代码逻辑并不复杂，特别是理解了停顿预测模型的思路，很容易读懂，源代码如下：

src\share\vm\gc_implementation\g1\g1CollectorPolicy.cpp

void G1CollectorPolicy::update_young_list_target_length(size_t rs_lengths) {
  if (rs_lengths == (size_t) -1) {
    // if it's set to the default value (-1), we should predict it;
    // otherwise, use the given value.
    rs_lengths = (size_t) get_new_prediction(_rs_lengths_seq);
  }

  // Calculate the absolute and desired min bounds.

  // This is how many young regions we already have (currently: the survivors).
  uint base_min_length = recorded_survivor_regions();
  // This is the absolute minimum young length, which ensures that we
  // can allocate one eden region in the worst-case.
  uint absolute_min_length = base_min_length + 1;
  uint desired_min_length =
                     calculate_young_list_desired_min_length(base_min_length);
  if (desired_min_length < absolute_min_length) {
    desired_min_length = absolute_min_length;
  }

  // Calculate the absolute and desired max bounds.

  // We will try our best not to "eat" into the reserve.
  uint absolute_max_length = 0;
  if (_free_regions_at_end_of_collection > _reserve_regions) {
    absolute_max_length = _free_regions_at_end_of_collection - _reserve_regions;
  }
  uint desired_max_length = calculate_young_list_desired_max_length();
  if (desired_max_length > absolute_max_length) {
    desired_max_length = absolute_max_length;
  }

  uint young_list_target_length = 0;
  if (adaptive_young_list_length()) {
    if (gcs_are_young()) {
      young_list_target_length =
                        calculate_young_list_target_length(rs_lengths,
                                                           base_min_length,
                                                           desired_min_length,
                                                           desired_max_length);
      _rs_lengths_prediction = rs_lengths;
    } else {
      // Don't calculate anything and let the code below bound it to
      // the desired_min_length, i.e., do the next GC as soon as
      // possible to maximize how many old regions we can add to it.
    }
  } else {
    // The user asked for a fixed young gen so we'll fix the young gen
    // whether the next GC is young or mixed.
    young_list_target_length = _young_list_fixed_length;
  }

  // Make sure we don't go over the desired max length, nor under the
  // desired min length. In case they clash, desired_min_length wins
  // which is why that test is second.
  if (young_list_target_length > desired_max_length) {
    young_list_target_length = desired_max_length;
  }
  if (young_list_target_length < desired_min_length) {
    young_list_target_length = desired_min_length;
  }

  assert(young_list_target_length > recorded_survivor_regions(),
         "we should be able to allocate at least one eden region");
  assert(young_list_target_length >= absolute_min_length, "post-condition");
  _young_list_target_length = young_list_target_length;

  update_max_gc_locker_expansion();
}

管理RSet

前面提到RSet用于管理对象引用关系，但是我们并没有提及怎么管理这种关系。G1中使用Refine线程异步地维护和管理引用关系。因为要异步处理，所以必须有一个数据结构来维护这些需要引用的对象。JVM在设计的时候，声明了一个全局的静态变量DirtyCardQueueSet（DCQS），DCQS里面存放的是DCQ，为了性能的考虑，所有处理引用关系的线程共享一个DCQS，每个Mutator（线程）在初始化的时候都关联这个DCQS。

src\share\vm\gc_implementation\g1\dirtyCardQueue.hpp

// A ptrQueue whose elements are "oops", pointers to object heads.
class DirtyCardQueue: public PtrQueue {
public:
  DirtyCardQueue(PtrQueueSet* qset_, bool perm = false) :
    // Dirty card queues are always active, so we create them with their
    // active field set to true.
    PtrQueue(qset_, perm, true /* active */) { }

  // Flush before destroying; queue may be used to capture pending work while
  // doing something else, with auto-flush on completion.
  ~DirtyCardQueue() { if (!is_permanent()) flush(); }

  // Process queue entries and release resources.
  void flush() { flush_impl(); }

  // Apply the closure to all elements, and reset the index to make the
  // buffer empty.  If a closure application returns "false", return
  // "false" immediately, halting the iteration.  If "consume" is true,
  // deletes processed entries from logs.
  bool apply_closure(CardTableEntryClosure* cl,
                     bool consume = true,
                     uint worker_i = 0);

  // Apply the closure to all elements of "buf", down to "index"
  // (inclusive.)  If returns "false", then a closure application returned
  // "false", and we return immediately.  If "consume" is true, entries are
  // set to NULL as they are processed, so they will not be processed again
  // later.
  static bool apply_closure_to_buffer(CardTableEntryClosure* cl,
                                      void** buf, size_t index, size_t sz,
                                      bool consume = true,
                                      uint worker_i = 0);
  void **get_buf() { return _buf;}
  void set_buf(void **buf) {_buf = buf;}
  size_t get_index() { return _index;}
  void reinitialize() { _buf = 0; _sz = 0; _index = 0;}
};

bool DirtyCardQueue::apply_closure(CardTableEntryClosure* cl,
                                   bool consume,
                                   uint worker_i) {
  bool res = true;
  if (_buf != NULL) {
    res = apply_closure_to_buffer(cl, _buf, _index, _sz,
                                  consume,
                                  worker_i);
    if (res && consume) _index = _sz;
  }
  return res;
}

bool DirtyCardQueue::apply_closure_to_buffer(CardTableEntryClosure* cl,
                                             void** buf,
                                             size_t index, size_t sz,
                                             bool consume,
                                             uint worker_i) {
  if (cl == NULL) return true;
  for (size_t i = index; i < sz; i += oopSize) {
    int ind = byte_index_to_index((int)i);
    jbyte* card_ptr = (jbyte*)buf[ind];
    if (card_ptr != NULL) {
      // Set the entry to null, so we don't do it again (via the test
      // above) if we reconsider this buffer.
      if (consume) buf[ind] = NULL;
      if (!cl->do_card_ptr(card_ptr, worker_i)) return false;
    }
  }
  return true;
}

src\share\vm\gc_implementation\g1\dirtyCardQueue.hpp

class DirtyCardQueueSet: public PtrQueueSet {
  // The closure used in mut_process_buffer().
  CardTableEntryClosure* _mut_process_closure;

  DirtyCardQueue _shared_dirty_card_queue;

  // Override.
  bool mut_process_buffer(void** buf);

  // Protected by the _cbl_mon.
  FreeIdSet* _free_ids;

  // The number of completed buffers processed by mutator and rs thread,
  // respectively.
  jint _processed_buffers_mut;
  jint _processed_buffers_rs_thread;

  // Current buffer node used for parallel iteration.
  BufferNode* volatile _cur_par_buffer_node;
public:
  DirtyCardQueueSet(bool notify_when_complete = true);

  void initialize(CardTableEntryClosure* cl, Monitor* cbl_mon, Mutex* fl_lock,
                  int process_completed_threshold,
                  int max_completed_queue,
                  Mutex* lock, PtrQueueSet* fl_owner = NULL);

  // The number of parallel ids that can be claimed to allow collector or
  // mutator threads to do card-processing work.
  static uint num_par_ids();

  static void handle_zero_index_for_thread(JavaThread* t);

  // Apply the given closure to all entries in all currently-active buffers.
  // This should only be applied at a safepoint. (Currently must not be called
  // in parallel; this should change in the future.)  If "consume" is true,
  // processed entries are discarded.
  void iterate_closure_all_threads(CardTableEntryClosure* cl,
                                   bool consume = true,
                                   uint worker_i = 0);

  // If there exists some completed buffer, pop it, then apply the
  // specified closure to all its elements, nulling out those elements
  // processed.  If all elements are processed, returns "true".  If no
  // completed buffers exist, returns false.  If a completed buffer exists,
  // but is only partially completed before a "yield" happens, the
  // partially completed buffer (with its processed elements set to NULL)
  // is returned to the completed buffer set, and this call returns false.
  bool apply_closure_to_completed_buffer(CardTableEntryClosure* cl,
                                         uint worker_i = 0,
                                         int stop_at = 0,
                                         bool during_pause = false);

  // Helper routine for the above.
  bool apply_closure_to_completed_buffer_helper(CardTableEntryClosure* cl,
                                                uint worker_i,
                                                BufferNode* nd);

  BufferNode* get_completed_buffer(int stop_at);

  // Applies the current closure to all completed buffers,
  // non-consumptively.
  void apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl);

  void reset_for_par_iteration() { _cur_par_buffer_node = _completed_buffers_head; }
  // Applies the current closure to all completed buffers, non-consumptively.
  // Parallel version.
  void par_apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl);

  DirtyCardQueue* shared_dirty_card_queue() {
    return &_shared_dirty_card_queue;
  }

  // Deallocate any completed log buffers
  void clear();

  // If a full collection is happening, reset partial logs, and ignore
  // completed ones: the full collection will make them all irrelevant.
  void abandon_logs();

  // If any threads have partial logs, add them to the global list of logs.
  void concatenate_logs();
  void clear_n_completed_buffers() { _n_completed_buffers = 0;}

  jint processed_buffers_mut() {
    return _processed_buffers_mut;
  }
  jint processed_buffers_rs_thread() {
    return _processed_buffers_rs_thread;
  }

};


DirtyCardQueueSet::DirtyCardQueueSet(bool notify_when_complete) :
  PtrQueueSet(notify_when_complete),
  _mut_process_closure(NULL),
  _shared_dirty_card_queue(this, true /*perm*/),
  _free_ids(NULL),
  _processed_buffers_mut(0), _processed_buffers_rs_thread(0)
{
  _all_active = true;
}

// Determines how many mutator threads can process the buffers in parallel.
uint DirtyCardQueueSet::num_par_ids() {
  return (uint)os::initial_active_processor_count();
}

void DirtyCardQueueSet::initialize(CardTableEntryClosure* cl, Monitor* cbl_mon, Mutex* fl_lock,
                                   int process_completed_threshold,
                                   int max_completed_queue,
                                   Mutex* lock, PtrQueueSet* fl_owner) {
  _mut_process_closure = cl;
  PtrQueueSet::initialize(cbl_mon, fl_lock, process_completed_threshold,
                          max_completed_queue, fl_owner);
  set_buffer_size(G1UpdateBufferSize);
  _shared_dirty_card_queue.set_lock(lock);
  _free_ids = new FreeIdSet((int) num_par_ids(), _cbl_mon);
}

void DirtyCardQueueSet::handle_zero_index_for_thread(JavaThread* t) {
  t->dirty_card_queue().handle_zero_index();
}

void DirtyCardQueueSet::iterate_closure_all_threads(CardTableEntryClosure* cl,
                                                    bool consume,
                                                    uint worker_i) {
  assert(SafepointSynchronize::is_at_safepoint(), "Must be at safepoint.");
  for(JavaThread* t = Threads::first(); t; t = t->next()) {
    bool b = t->dirty_card_queue().apply_closure(cl, consume);
    guarantee(b, "Should not be interrupted.");
  }
  bool b = shared_dirty_card_queue()->apply_closure(cl,
                                                    consume,
                                                    worker_i);
  guarantee(b, "Should not be interrupted.");
}

bool DirtyCardQueueSet::mut_process_buffer(void** buf) {

  // Used to determine if we had already claimed a par_id
  // before entering this method.
  bool already_claimed = false;

  // We grab the current JavaThread.
  JavaThread* thread = JavaThread::current();

  // We get the the number of any par_id that this thread
  // might have already claimed.
  uint worker_i = thread->get_claimed_par_id();

  // If worker_i is not UINT_MAX then the thread has already claimed
  // a par_id. We make note of it using the already_claimed value
  if (worker_i != UINT_MAX) {
    already_claimed = true;
  } else {

    // Otherwise we need to claim a par id
    worker_i = _free_ids->claim_par_id();

    // And store the par_id value in the thread
    thread->set_claimed_par_id(worker_i);
  }

  bool b = false;
  if (worker_i != UINT_MAX) {
    b = DirtyCardQueue::apply_closure_to_buffer(_mut_process_closure, buf, 0,
                                                _sz, true, worker_i);
    if (b) Atomic::inc(&_processed_buffers_mut);

    // If we had not claimed an id before entering the method
    // then we must release the id.
    if (!already_claimed) {

      // we release the id
      _free_ids->release_par_id(worker_i);

      // and set the claimed_id in the thread to UINT_MAX
      thread->set_claimed_par_id(UINT_MAX);
    }
  }
  return b;
}


BufferNode*
DirtyCardQueueSet::get_completed_buffer(int stop_at) {
  BufferNode* nd = NULL;
  MutexLockerEx x(_cbl_mon, Mutex::_no_safepoint_check_flag);

  if ((int)_n_completed_buffers <= stop_at) {
    _process_completed = false;
    return NULL;
  }

  if (_completed_buffers_head != NULL) {
    nd = _completed_buffers_head;
    _completed_buffers_head = nd->next();
    if (_completed_buffers_head == NULL)
      _completed_buffers_tail = NULL;
    _n_completed_buffers--;
    assert(_n_completed_buffers >= 0, "Invariant");
  }
  debug_only(assert_completed_buffer_list_len_correct_locked());
  return nd;
}

bool DirtyCardQueueSet::
apply_closure_to_completed_buffer_helper(CardTableEntryClosure* cl,
                                         uint worker_i,
                                         BufferNode* nd) {
  if (nd != NULL) {
    void **buf = BufferNode::make_buffer_from_node(nd);
    size_t index = nd->index();
    bool b =
      DirtyCardQueue::apply_closure_to_buffer(cl, buf,
                                              index, _sz,
                                              true, worker_i);
    if (b) {
      deallocate_buffer(buf);
      return true;  // In normal case, go on to next buffer.
    } else {
      enqueue_complete_buffer(buf, index);
      return false;
    }
  } else {
    return false;
  }
}

bool DirtyCardQueueSet::apply_closure_to_completed_buffer(CardTableEntryClosure* cl,
                                                          uint worker_i,
                                                          int stop_at,
                                                          bool during_pause) {
  assert(!during_pause || stop_at == 0, "Should not leave any completed buffers during a pause");
  BufferNode* nd = get_completed_buffer(stop_at);
  bool res = apply_closure_to_completed_buffer_helper(cl, worker_i, nd);
  if (res) Atomic::inc(&_processed_buffers_rs_thread);
  return res;
}

void DirtyCardQueueSet::apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl) {
  BufferNode* nd = _completed_buffers_head;
  while (nd != NULL) {
    bool b =
      DirtyCardQueue::apply_closure_to_buffer(cl,
                                              BufferNode::make_buffer_from_node(nd),
                                              0, _sz, false);
    guarantee(b, "Should not stop early.");
    nd = nd->next();
  }
}

void DirtyCardQueueSet::par_apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl) {
  BufferNode* nd = _cur_par_buffer_node;
  while (nd != NULL) {
    BufferNode* next = (BufferNode*)nd->next();
    BufferNode* actual = (BufferNode*)Atomic::cmpxchg_ptr((void*)next, (volatile void*)&_cur_par_buffer_node, (void*)nd);
    if (actual == nd) {
      bool b =
        DirtyCardQueue::apply_closure_to_buffer(cl,
                                                BufferNode::make_buffer_from_node(actual),
                                                0, _sz, false);
      guarantee(b, "Should not stop early.");
      nd = next;
    } else {
      nd = actual;
    }
  }
}

// Deallocates any completed log buffers
void DirtyCardQueueSet::clear() {
  BufferNode* buffers_to_delete = NULL;
  {
    MutexLockerEx x(_cbl_mon, Mutex::_no_safepoint_check_flag);
    while (_completed_buffers_head != NULL) {
      BufferNode* nd = _completed_buffers_head;
      _completed_buffers_head = nd->next();
      nd->set_next(buffers_to_delete);
      buffers_to_delete = nd;
    }
    _n_completed_buffers = 0;
    _completed_buffers_tail = NULL;
    debug_only(assert_completed_buffer_list_len_correct_locked());
  }
  while (buffers_to_delete != NULL) {
    BufferNode* nd = buffers_to_delete;
    buffers_to_delete = nd->next();
    deallocate_buffer(BufferNode::make_buffer_from_node(nd));
  }

}

void DirtyCardQueueSet::abandon_logs() {
  assert(SafepointSynchronize::is_at_safepoint(), "Must be at safepoint.");
  clear();
  // Since abandon is done only at safepoints, we can safely manipulate
  // these queues.
  for (JavaThread* t = Threads::first(); t; t = t->next()) {
    t->dirty_card_queue().reset();
  }
  shared_dirty_card_queue()->reset();
}


void DirtyCardQueueSet::concatenate_logs() {
  // Iterate over all the threads, if we find a partial log add it to
  // the global list of logs.  Temporarily turn off the limit on the number
  // of outstanding buffers.
  int save_max_completed_queue = _max_completed_queue;
  _max_completed_queue = max_jint;
  assert(SafepointSynchronize::is_at_safepoint(), "Must be at safepoint.");
  for (JavaThread* t = Threads::first(); t; t = t->next()) {
    DirtyCardQueue& dcq = t->dirty_card_queue();
    if (dcq.size() != 0) {
      void **buf = t->dirty_card_queue().get_buf();
      // We must NULL out the unused entries, then enqueue.
      for (size_t i = 0; i < t->dirty_card_queue().get_index(); i += oopSize) {
        buf[PtrQueue::byte_index_to_index((int)i)] = NULL;
      }
      enqueue_complete_buffer(dcq.get_buf(), dcq.get_index());
      dcq.reinitialize();
    }
  }
  if (_shared_dirty_card_queue.size() != 0) {
    enqueue_complete_buffer(_shared_dirty_card_queue.get_buf(),
                            _shared_dirty_card_queue.get_index());
    _shared_dirty_card_queue.reinitialize();
  }
  // Restore the completed buffer queue limit.
  _max_completed_queue = save_max_completed_queue;
}

每个Mutator都有一个私有的队列，每个队列的最大长度由G1UpdateBufferSize（默认值为256）确定，即最多存放256个引用关系对象，在本线程中如果产生新的对象引用关系则把引用者放入DCQ中，当满256个时，就会把这个队列放入到DCQS中（DCQS可以被所有线程共享，所以放入时需要加锁），当然可以手动提交当前线程的队列（当队列还没有满的时候，提交时要指明有多少个引用关系）。而DCQ的处理则是通过Refine线程。DCQS初始化代码如下：

src\share\vm\gc_implementation\g1\g1CollectedHeap.cpp

  JavaThread::satb_mark_queue_set().initialize(SATB_Q_CBL_mon,
                                               SATB_Q_FL_lock,
                                               G1SATBProcessCompletedThreshold,
                                               Shared_SATB_Q_lock);

  JavaThread::dirty_card_queue_set().initialize(_refine_cte_cl,
                                                DirtyCardQ_CBL_mon,
                                                DirtyCardQ_FL_lock,
                                                concurrent_g1_refine()->yellow_zone(),
                                                concurrent_g1_refine()->red_zone(),
                                                Shared_DirtyCardQ_lock);

  dirty_card_queue_set().initialize(NULL, // Should never be called by the Java code
                                    DirtyCardQ_CBL_mon,
                                    DirtyCardQ_FL_lock,
                                    -1, // never trigger processing
                                    -1, // no limit on length
                                    Shared_DirtyCardQ_lock,
                                    &JavaThread::dirty_card_queue_set());

  // Initialize the card queue set used to hold cards containing
  // references into the collection set.
  _into_cset_dirty_card_queue_set.initialize(NULL, // Should never be called by the Java code
                                             DirtyCardQ_CBL_mon,
                                             DirtyCardQ_FL_lock,
                                             -1, // never trigger processing
                                             -1, // no limit on length
                                             Shared_DirtyCardQ_lock,
                                             &JavaThread::dirty_card_queue_set());

在这里有一个全局的Monitor，即DirtyCardQ_CBL_mon，它的目的是什么？我们知道任意的Mutator都可以通过JavaThread中的静态方法找到DCQS这个静态成员变量，每当DCQ满了之后都会把这个DCQ加入到DCQS中。当DCQ加入成功，并且满足一定条件时（这里的条件是DCQS中DCQ的个数大于一个阈值，这个阈值和后文的Green Zone相关），调用就是通过这个Monitor发送Notify通知0号Refine线程启动。因为0号Refine线程可能会被任意一个Mutator来通知，所以这里的Monitor是一个全局变量，可以被任意的Mutator访问。

把DCQ加入到DCQS的方法是enqueue_complete_buffer，它定义在PtrQueueSet中，PtrQueueSet是DirtyCardQueueSet的父类。enqueue_complete_buffer是通过process_or_enqueue_complete_buffer完成添加的。在process_or_enqueue_complete_buffer中如果Mutator发现DCQS已经满了，那么就不继续往DCQS中添加了，这个时候说明引用变更太多了，Refine线程负载太重，这个Mutator就会暂停其他代码执行，替代Refine线程来更新RSet。把对象加入到DCQ的代码如下所示：

hotspot/src/share/vm/gc_implementation/g1/ptrQueue.hpp
// 把对象放入到DCQ中，实际上DCQ就是一个buffer
void PtrQueue::enqueue(void* ptr) {
  if (!_active) return;
  else enqueue_known_active(ptr);
}

上面的enqueue_known_active就是判断当前DCQ是否还有空间，如果有则直接加入，如果没有则调用handle_zero_index，它再调用process_or_enqueue_complete_buffer并根据返回值决定是否申请新的DCQ，代码如下所示：

hotspot/src/share/vm/gc_implementation/g1/ptrQueue.cpp
void PtrQueue::enqueue_known_active(void* ptr) {
  // index为0，表示DCQ已经满了，需要把DCQ加入到DCQS中，并申请新的DCQ
  while (_index == 0) {
    handle_zero_index();
  }
  // 在这里，无论如何都会有合适的DCQ可以使用，因为满的DCQ会申请新的。直接加入对象
  _index -= oopSize;
  _buf[byte_index_to_index((int)_index)] = ptr;
}
// 下面就是处理DCQ满的情况
void PtrQueue::handle_zero_index() {
  // 这里先进行二次判断，是为了防止DCQ满的情况下同一线程多次进入分配
  if (_buf != NULL) {
    if (!should_enqueue_buffer()) {
      return;
    }
    if (_lock) {
      /*进入这里，说明使用的是全局的DCQ。这里需要考虑多线程的情况。大体可以总结为：把全局DCQ放入到DCQS中，然后再为全局的DCQ申请新的空间。这里引入一个局部变量buf的目的在于处理多线程的竞争。*/
      void** buf = _buf;   // local pointer to completed buffer
      _buf = NULL;         // clear shared _buf field
      /*这里的locking_enqueue_completed_buffer和后面的enqueue_completed_buffer几乎是一样的，唯一的区别就是锁的处理，因为这里是全局DCQ所以涉及加锁和解锁。*/
      locking_enqueue_completed_buffer(buf);
      // 如果_buf不为null，说明其他的线程已经成功地为全局DCQ申请到空间了，直接返回
      if (_buf != NULL) return;
  } else {
      // 此处就是普通的DCQ处理
      if (qset()->process_or_enqueue_complete_buffer(_buf)) {
        // 返回值为真，说明Mutator暂停执行应用代码，帮助处理DCQ，所以此时可以重用DCQ
        _sz = qset()->buffer_size();
        _index = _sz;
        return;
      }
    }
  }
  // 为DCQ申请新的空间
  _buf = qset()->allocate_buffer();
  _sz = qset()->buffer_size();
  _index = _sz;
}
// 处理DCQ，根据情况判定是否需要Mutator介入
bool PtrQueueSet::process_or_enqueue_complete_buffer(void** buf) {
  if (Thread::current()->is_Java_thread()) {
// 条件为真，就说明需要Mutator介入这里没有加锁，允许一定的竞争，原因在于如果条件不满足
// 最坏的后果就是Mutator处理
    if (_max_completed_queue == 0 || _max_completed_queue > 0 &&
        _n_completed_buffers >= _max_completed_queue + _completed_queue_padding) {
      bool b = mut_process_buffer(buf);
      if (b)         return true;
    }
  }
  // 把buffer加入到DCQS中，注意这里加入之后调用者将会分配一个新的buffer
  // 是否生成新的buffer依赖于返回值，false表示需要新的buffer
  enqueue_complete_buffer(buf);
  return false;
}
// 其实这个函数也非常简单，就是DCQ形成一个链表
void PtrQueueSet::enqueue_complete_buffer(void** buf, size_t index) {
  MutexLockerEx x(_cbl_mon, Mutex::_no_safepoint_check_flag);
  BufferNode* cbn = BufferNode::new_from_buffer(buf);
  cbn->set_index(index);
  if (_completed_buffers_tail == NULL) {
    assert(_completed_buffers_head == NULL, "Well-formedness");
    _completed_buffers_head = cbn;
    _completed_buffers_tail = cbn;
  } else {
    _completed_buffers_tail->set_next(cbn);
    _completed_buffers_tail = cbn;
  }
  _n_completed_buffers++;
  // 这里是判断是否需要有Refine线程工作，如果没有线程工作通过notify通知启动
  if (!_process_completed && _process_completed_threshold >= 0 &&
      _n_completed_buffers >= _process_completed_threshold) {
    _process_completed = true;
if (_notify_when_complete)
  // 这里其实就是通知0号Refine线程
      _cbl_mon->notify();
  }
}

我们提到当Refine线程忙不过来的时候，G1让Mutator帮忙处理引用变更。当然Refine线程个数可以由用户设置，但是通过上面数据结构的描述，可以发现仍然可能存在因对象引用修改太多，导致Refine线程太忙，处理不过来。所以Mutator来处理引用变更，就会导致业务暂停处理，如果发生了这种情况，说明修改太多，或者Refine数目设置得太少。我们可以通过参数G1SummarizeRSetStats打开RSet处理过程中的日志，从中能发现处理线程的信息。下面我们看一下Mutator是如何处理DCQ的。

Mutator处理DCQ

队列set的最大长度依赖于Refine线程的个数，最大为Red Zone的个数（关于Red Zone见下一节介绍，这里简单理解为一个数字），当队列set里面的队列个数超过Red Zone的个数时，提交队列的Mutator就不能把这个队列放入到set中，此时，Mutator就会直接处理这个队列的引用。代码如下：

bool DirtyCardQueueSet::mut_process_buffer(void** buf) {

  // Used to determine if we had already claimed a par_id
  // before entering this method.
  bool already_claimed = false;

  // We grab the current JavaThread.
  JavaThread* thread = JavaThread::current();

  // We get the the number of any par_id that this thread
  // might have already claimed.
  uint worker_i = thread->get_claimed_par_id();

  // If worker_i is not UINT_MAX then the thread has already claimed
  // a par_id. We make note of it using the already_claimed value
  if (worker_i != UINT_MAX) {
    already_claimed = true;
  } else {

    // Otherwise we need to claim a par id
    worker_i = _free_ids->claim_par_id();

    // And store the par_id value in the thread
    thread->set_claimed_par_id(worker_i);
  }

  bool b = false;
  if (worker_i != UINT_MAX) {
    b = DirtyCardQueue::apply_closure_to_buffer(_mut_process_closure, buf, 0,
                                                _sz, true, worker_i);
    if (b) Atomic::inc(&_processed_buffers_mut);

    // If we had not claimed an id before entering the method
    // then we must release the id.
    if (!already_claimed) {

      // we release the id
      _free_ids->release_par_id(worker_i);

      // and set the claimed_id in the thread to UINT_MAX
      thread->set_claimed_par_id(UINT_MAX);
    }
  }
  return b;
}

Refine线程的工作原理

Refine线程的初始化是在GC管理器初始化的时候进行，但是如果没有足够多的引用关系变更，这些Refine线程都是空转，所以需要一个机制能动态激活和冻结线程，JVM通过wait和notify机制来实现。设计思想是：从0到n-1线程（n表示Refine线程的个数），都是由前一个线程发现自己太忙，激活后一个；后一个线程发现自己太闲的时候则主动冻结自己。那么第0个线程在何时被激活？第0个线程是由正在运行的Java线程来激活的，当Java线程（Mutator）尝试把修改的引用放入到队列时，如果0号线程还没激活，则发送notify信号激活它。所以在设计的时候，0号线程可能会由任意一个Mutator来通知，而1号到n-1号线程只能有前一个标号的Refine线程通知。因为0号线程可以由任意Mutator通知，所以0号线程等待的Monitor是一个全局变量，而1号到n-1号线程中的Monitor则是局部变量。

src\share\vm\gc_implementation\g1\concurrentG1RefineThread.hpp

// The G1 Concurrent Refinement Thread (could be several in the future).

class ConcurrentG1RefineThread: public ConcurrentGCThread {
  friend class VMStructs;
  friend class G1CollectedHeap;

  double _vtime_start;  // Initial virtual time.
  double _vtime_accum;  // Initial virtual time.
  uint _worker_id;
  uint _worker_id_offset;

  // The refinement threads collection is linked list. A predecessor can activate a successor
  // when the number of the rset update buffer crosses a certain threshold. A successor
  // would self-deactivate when the number of the buffers falls below the threshold.
  bool _active;
  ConcurrentG1RefineThread* _next;
  Monitor* _monitor;
  ConcurrentG1Refine* _cg1r;

  // The closure applied to completed log buffers.
  CardTableEntryClosure* _refine_closure;

  int _thread_threshold_step;
  // This thread activation threshold
  int _threshold;
  // This thread deactivation threshold
  int _deactivation_threshold;

  void sample_young_list_rs_lengths();
  void run_young_rs_sampling();
  void wait_for_completed_buffers();

  void set_active(bool x) { _active = x; }
  bool is_active();
  void activate();
  void deactivate();

public:
  virtual void run();
  // Constructor
  ConcurrentG1RefineThread(ConcurrentG1Refine* cg1r, ConcurrentG1RefineThread* next,
                           CardTableEntryClosure* refine_closure,
                           uint worker_id_offset, uint worker_id);

  void initialize();

  // Printing
  void print() const;
  void print_on(outputStream* st) const;

  // Total virtual time so far.
  double vtime_accum() { return _vtime_accum; }

  ConcurrentG1Refine* cg1r() { return _cg1r;     }

  // shutdown
  void stop();
};

Refine线程的主要工作在run方法中，代码如下：

src\share\vm\gc_implementation\g1\concurrentG1RefineThread.cpp

ConcurrentG1RefineThread::ConcurrentG1RefineThread(ConcurrentG1Refine* cg1r, ConcurrentG1RefineThread *next,
                         CardTableEntryClosure* refine_closure,
                         uint worker_id_offset, uint worker_id) :
  ConcurrentGCThread(),
  _refine_closure(refine_closure),
  _worker_id_offset(worker_id_offset),
  _worker_id(worker_id),
  _active(false),
  _next(next),
  _monitor(NULL),
  _cg1r(cg1r),
  _vtime_accum(0.0)
{

  // Each thread has its own monitor. The i-th thread is responsible for signalling
  // to thread i+1 if the number of buffers in the queue exceeds a threashold for this
  // thread. Monitors are also used to wake up the threads during termination.
  // The 0th worker in notified by mutator threads and has a special monitor.
  // The last worker is used for young gen rset size sampling.
  if (worker_id > 0) {
    _monitor = new Monitor(Mutex::nonleaf, "Refinement monitor", true);
  } else {
    _monitor = DirtyCardQ_CBL_mon;
  }
  initialize();
  create_and_start();
}

void ConcurrentG1RefineThread::initialize() {
  if (_worker_id < cg1r()->worker_thread_num()) {
    // Current thread activation threshold
    _threshold = MIN2<int>(cg1r()->thread_threshold_step() * (_worker_id + 1) + cg1r()->green_zone(),
                           cg1r()->yellow_zone());
    // A thread deactivates once the number of buffer reached a deactivation threshold
    _deactivation_threshold = MAX2<int>(_threshold - cg1r()->thread_threshold_step(), cg1r()->green_zone());
  } else {
    set_active(true);
  }
}

void ConcurrentG1RefineThread::sample_young_list_rs_lengths() {
  SuspendibleThreadSetJoiner sts;
  G1CollectedHeap* g1h = G1CollectedHeap::heap();
  G1CollectorPolicy* g1p = g1h->g1_policy();
  if (g1p->adaptive_young_list_length()) {
    int regions_visited = 0;
    g1h->young_list()->rs_length_sampling_init();
    while (g1h->young_list()->rs_length_sampling_more()) {
      g1h->young_list()->rs_length_sampling_next();
      ++regions_visited;

      // we try to yield every time we visit 10 regions
      if (regions_visited == 10) {
        if (sts.should_yield()) {
          sts.yield();
          // we just abandon the iteration
          break;
        }
        regions_visited = 0;
      }
    }

    g1p->revise_young_list_target_length_if_necessary();
  }
}

void ConcurrentG1RefineThread::run_young_rs_sampling() {
  DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
  _vtime_start = os::elapsedVTime();
  while(!_should_terminate) {
    sample_young_list_rs_lengths();

    if (os::supports_vtime()) {
      _vtime_accum = (os::elapsedVTime() - _vtime_start);
    } else {
      _vtime_accum = 0.0;
    }

    MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
    if (_should_terminate) {
      break;
    }
    _monitor->wait(Mutex::_no_safepoint_check_flag, G1ConcRefinementServiceIntervalMillis);
  }
}

void ConcurrentG1RefineThread::wait_for_completed_buffers() {
  DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
  MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
  while (!_should_terminate && !is_active()) {
    _monitor->wait(Mutex::_no_safepoint_check_flag);
  }
}

bool ConcurrentG1RefineThread::is_active() {
  DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
  return _worker_id > 0 ? _active : dcqs.process_completed_buffers();
}

void ConcurrentG1RefineThread::activate() {
  MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
  if (_worker_id > 0) {
    if (G1TraceConcRefinement) {
      DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
      gclog_or_tty->print_cr("G1-Refine-activated worker %d, on threshold %d, current %d",
                             _worker_id, _threshold, (int)dcqs.completed_buffers_num());
    }
    set_active(true);
  } else {
    DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
    dcqs.set_process_completed(true);
  }
  _monitor->notify();
}

void ConcurrentG1RefineThread::deactivate() {
  MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
  if (_worker_id > 0) {
    if (G1TraceConcRefinement) {
      DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
      gclog_or_tty->print_cr("G1-Refine-deactivated worker %d, off threshold %d, current %d",
                             _worker_id, _deactivation_threshold, (int)dcqs.completed_buffers_num());
    }
    set_active(false);
  } else {
    DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
    dcqs.set_process_completed(false);
  }
}

void ConcurrentG1RefineThread::run() {
  // 初始化线程私有信息
  initialize_in_thread();
  wait_for_universe_init();

  
  // Refine的最后一个线程用于处理YHR的抽样，抽样的作用在前面已经提到，
  // 就是为了预测停顿时间并调整分区数目
  if (_worker_id >= cg1r()->worker_thread_num()) {
    run_young_rs_sampling();
    terminate();
    return;
  }

  _vtime_start = os::elapsedVTime();

  // 0~n-1线程是真正的Refine线程，处理RSet
  while (!_should_terminate) {
    DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();

    // Wait for work
    // 这个就是我们上面提到的前一个线程通知后一个线程，0号线程由Mutator通知
    wait_for_completed_buffers();

    if (_should_terminate) {
      break;
    }

    {
      SuspendibleThreadSetJoiner sts;

      do {
        int curr_buffer_num = (int)dcqs.completed_buffers_num();
        // If the number of the buffers falls down into the yellow zone,
        // that means that the transition period after the evacuation pause has ended.
        if (dcqs.completed_queue_padding() > 0 && curr_buffer_num <= cg1r()->yellow_zone()) {
          dcqs.set_completed_queue_padding(0);
        }
        // 根据负载判断是否需要停止当前的Refine线程，如果需要则停止。
        if (_worker_id > 0 && curr_buffer_num <= _deactivation_threshold) {
          // If the number of the buffer has fallen below our threshold
          // we should deactivate. The predecessor will reactivate this
          // thread should the number of the buffers cross the threshold again.
          deactivate();
          break;
        }

        // Check if we need to activate the next thread.
        // 根据负载判断是否需要通知/启动新的Refine线程，如果需要则发一个通知。
        if (_next != NULL && !_next->is_active() && curr_buffer_num > _next->_threshold) {
          _next->activate();
        }
      } while (dcqs.apply_closure_to_completed_buffer(_refine_closure, _worker_id + _worker_id_offset, cg1r()->green_zone()));

      // We can exit the loop above while being active if there was a yield request.
      // 当有yield请求时退出循环，目的是为了进入安全点
      if (is_active()) {
        deactivate();
      }
    }

    if (os::supports_vtime()) {
      _vtime_accum = (os::elapsedVTime() - _vtime_start);
    } else {
      _vtime_accum = 0.0;
    }
  }
  assert(_should_terminate, "just checking");
  terminate();
}

void ConcurrentG1RefineThread::stop() {
  // it is ok to take late safepoints here, if needed
  {
    MutexLockerEx mu(Terminator_lock);
    _should_terminate = true;
  }

  {
    MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
    _monitor->notify();
  }

  {
    MutexLockerEx mu(Terminator_lock);
    while (!_has_terminated) {
      Terminator_lock->wait();
    }
  }
  if (G1TraceConcRefinement) {
    gclog_or_tty->print_cr("G1-Refine-stop");
  }
}

void ConcurrentG1RefineThread::print() const {
  print_on(tty);
}

void ConcurrentG1RefineThread::print_on(outputStream* st) const {
  st->print("\"G1 Concurrent Refinement Thread#%d\" ", _worker_id);
  Thread::print_on(st);
  st->cr();
}

Refine线程主要工作就是处理DCQS，具体在这个while循环中：(dcqs.apply_closure_to_completed_buffer(_refine_closure, _worker_id +_worker_id_offset, cg1r()->green_zone()));循环调用apply_closure_to_completed_buffer，这个方法传递了几个参数：

src\share\vm\gc_implementation\g1\dirtyCardQueue.cpp

bool DirtyCardQueueSet::apply_closure_to_completed_buffer(CardTableEntryClosure* cl,
                                                          uint worker_i,
                                                          int stop_at,
                                                          bool during_pause) {
  assert(!during_pause || stop_at == 0, "Should not leave any completed buffers during a pause");
  BufferNode* nd = get_completed_buffer(stop_at);
  bool res = apply_closure_to_completed_buffer_helper(cl, worker_i, nd);
  if (res) Atomic::inc(&_processed_buffers_rs_thread);
  return res;
}

bool DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(
    CardTableEntryClosure* cl, uint worker_i, BufferNode* nd) {
  if (nd != NULL) {
    void **buf = BufferNode::make_buffer_from_node(nd);
    size_t index = nd->index();
    bool b = DirtyCardQueue::apply_closure_to_buffer(cl, buf,
                                              index, _sz,
                                              true, worker_i);
    if (b) {
      deallocate_buffer(buf);
      return true;  // In normal case, go on to next buffer.
    } else {
      enqueue_complete_buffer(buf, index);
      return false;
    }
  } else {
    return false;
  }
}

参数Closure，真正处理卡表。
参数worker id + workerid offset，工作线程要处理的开始位置，让不同的Refine线程处理DCQS中不同的DCQ。
参数cglr()->green zone()，就是Green Zone的数值，也就是说所有的Refine线程在处理的时候都知道要跳过至少Green的个数的DCQ，即忽略DCQS中DCQ的区域。同时也可以想象到，在GC收集的地方这个参数一定会传入0，表示要处理所有的DCQ。可以参看下文新生代回收中的G1CollectedHeap::iterate_dirty_card_closure。

另外因为queue set是全局共享，对queue set的处理是需要加锁的。这个方法会调用DirtyCardQueue::apply_closure_to_buffer，代码如下所示：

src\share\vm\gc_implementation\g1\g1CollectedHeap.cpp

void G1CollectedHeap::iterate_dirty_card_closure(CardTableEntryClosure* cl,
                                                 DirtyCardQueue* into_cset_dcq,
                                                 bool concurrent,
                                                 uint worker_i) {
  // Clean cards in the hot card cache
  G1HotCardCache* hot_card_cache = _cg1r->hot_card_cache();
  hot_card_cache->drain(worker_i, g1_rem_set(), into_cset_dcq);

  DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
  size_t n_completed_buffers = 0;
  while (dcqs.apply_closure_to_completed_buffer(cl, worker_i, 0, true)) {
    n_completed_buffers++;
  }
  g1_policy()->phase_times()->record_thread_work_item(G1GCPhaseTimes::UpdateRS, worker_i, n_completed_buffers);
  dcqs.clear_n_completed_buffers();
  assert(!dcqs.completed_buffers_exist_dirty(), "Completed buffers exist!");
}

另外因为queue set是全局共享，对queue set的处理是需要加锁的。这个方法会调用DirtyCardQueue::apply_closure_to_buffer，代码如下所示：

src\share\vm\gc_implementation\g1\dirtyCardQueue.cpp

bool DirtyCardQueue::apply_closure_to_buffer(CardTableEntryClosure* cl,
                                             void** buf,
                                             size_t index, size_t sz,
                                             bool consume,
                                             uint worker_i) {
  if (cl == NULL) return true;
  for (size_t i = index; i < sz; i += oopSize) {
    int ind = byte_index_to_index((int)i);
    jbyte* card_ptr = (jbyte*)buf[ind];
    if (card_ptr != NULL) {
      // Set the entry to null, so we don't do it again (via the test
      // above) if we reconsider this buffer.
      // 设置buf为NULL，再对buf遍历时就可以快速跳过NULL
      if (consume) buf[ind] = NULL;
      if (!cl->do_card_ptr(card_ptr, worker_i)) return false;
    }
  }
  return true;
}

最终会调用refine_card，代码如下所示：

src\share\vm\gc_implementation\g1\g1RemSet.cpp

bool G1RemSet::refine_card(jbyte* card_ptr, uint worker_i,
                           bool check_for_refs_into_cset) {
  assert(_g1->is_in_exact(_ct_bs->addr_for(card_ptr)),
         err_msg("Card at " PTR_FORMAT " index " SIZE_FORMAT " representing heap at " PTR_FORMAT " (%u) must be in committed heap",
                 p2i(card_ptr),
                 _ct_bs->index_for(_ct_bs->addr_for(card_ptr)),
                 _ct_bs->addr_for(card_ptr),
                 _g1->addr_to_region(_ct_bs->addr_for(card_ptr))));

  // If the card is no longer dirty, nothing to do.
  // 如果卡表指针对应的值已经不是dirty，说明该指针已经处理过了，所以不再需要处理，直接返回
  if (*card_ptr != CardTableModRefBS::dirty_card_val()) {
    // No need to return that this card contains refs that point
    // into the collection set.
    return false;
  }

  // Construct the region representing the card.
  // 找到卡表指针所在的分区
  HeapWord* start = _ct_bs->addr_for(card_ptr);
  // And find the region containing it.
  HeapRegion* r = _g1->heap_region_containing(start);

  // Why do we have to check here whether a card is on a young region,
  // given that we dirty young regions and, as a result, the
  // post-barrier is supposed to filter them out and never to enqueue
  // them? When we allocate a new region as the "allocation region" we
  // actually dirty its cards after we release the lock, since card
  // dirtying while holding the lock was a performance bottleneck. So,
  // as a result, it is possible for other threads to actually
  // allocate objects in the region (after the acquire the lock)
  // before all the cards on the region are dirtied. This is unlikely,
  // and it doesn't happen often, but it can happen. So, the extra
  // check below filters out those cards.
  /*引用者是新生代或者在CSet都不需要更新，因为他们都会在GC中被收集。
  实际上在引用关系进入到队列的时候会被过滤，4.4节写屏障时会介绍。
  问题是为什么我们还需要再次过滤？主要是考虑并发的因素。比如并发分配或者并行任务窃取等。*/
  if (r->is_young()) {
    return false;
  }

  // While we are processing RSet buffers during the collection, we
  // actually don't want to scan any cards on the collection set,
  // since we don't want to update remebered sets with entries that
  // point into the collection set, given that live objects from the
  // collection set are about to move and such entries will be stale
  // very soon. This change also deals with a reliability issue which
  // involves scanning a card in the collection set and coming across
  // an array that was being chunked and looking malformed. Note,
  // however, that if evacuation fails, we have to scan any objects
  // that were not moved and create any missing entries.
  if (r->in_collection_set()) {
    return false;
  }

  // The result from the hot card cache insert call is either:
  //   * pointer to the current card
  //     (implying that the current card is not 'hot'),
  //   * null
  //     (meaning we had inserted the card ptr into the "hot" card cache,
  //     which had some headroom),
  //   * a pointer to a "hot" card that was evicted from the "hot" cache.
  //
  /*对于热表可以通过参数控制，处理的时候如果发现它不热，则直接处理；
    如果热的话则留待后续批量处理。
    如果热表存的对象太多，最老的则会被赶出继续处理。*/
  G1HotCardCache* hot_card_cache = _cg1r->hot_card_cache();
  if (hot_card_cache->use_cache()) {
    assert(!check_for_refs_into_cset, "sanity");
    assert(!SafepointSynchronize::is_at_safepoint(), "sanity");

    card_ptr = hot_card_cache->insert(card_ptr);
    if (card_ptr == NULL) {
      // There was no eviction. Nothing to do.
      return false;
    }

    start = _ct_bs->addr_for(card_ptr);
    r = _g1->heap_region_containing(start);

    // Checking whether the region we got back from the cache
    // is young here is inappropriate. The region could have been
    // freed, reallocated and tagged as young while in the cache.
    // Hence we could see its young type change at any time.
  }

  // Don't use addr_for(card_ptr + 1) which can ask for
  // a card beyond the heap.  This is not safe without a perm
  // gen at the upper end of the heap.
  // 确定要处理的内存块为512个字节
  HeapWord* end   = start + CardTableModRefBS::card_size_in_words;
  MemRegion dirtyRegion(start, end);

#if CARD_REPEAT_HISTO
  init_ct_freq_table(_g1->max_capacity());
  ct_freq_note_card(_ct_bs->index_for(start));
#endif

  // 定义Closure处理对象，最主要的是G1ParPushHeapRSClosure
  G1ParPushHeapRSClosure* oops_in_heap_closure = NULL;
  if (check_for_refs_into_cset) {
    // ConcurrentG1RefineThreads have worker numbers larger than what
    // _cset_rs_update_cl[] is set up to handle. But those threads should
    // only be active outside of a collection which means that when they
    // reach here they should have check_for_refs_into_cset == false.
    assert((size_t)worker_i < n_workers(), "index of worker larger than _cset_rs_update_cl[].length");
    oops_in_heap_closure = _cset_rs_update_cl[worker_i];
  }
  G1UpdateRSOrPushRefOopClosure update_rs_oop_cl(_g1,
                                                 _g1->g1_rem_set(),
                                                 oops_in_heap_closure,
                                                 check_for_refs_into_cset,
                                                 worker_i);
  update_rs_oop_cl.set_from(r);

  G1TriggerClosure trigger_cl;
  FilterIntoCSClosure into_cs_cl(NULL, _g1, &trigger_cl);
  G1InvokeIfNotTriggeredClosure invoke_cl(&trigger_cl, &into_cs_cl);
  G1Mux2Closure mux(&invoke_cl, &update_rs_oop_cl);

  FilterOutOfRegionClosure filter_then_update_rs_oop_cl(r,
                        (check_for_refs_into_cset ?
                                (OopClosure*)&mux :
                                (OopClosure*)&update_rs_oop_cl));

  // The region for the current card may be a young region. The
  // current card may have been a card that was evicted from the
  // card cache. When the card was inserted into the cache, we had
  // determined that its region was non-young. While in the cache,
  // the region may have been freed during a cleanup pause, reallocated
  // and tagged as young.
  //
  // We wish to filter out cards for such a region but the current
  // thread, if we're running concurrently, may "see" the young type
  // change at any time (so an earlier "is_young" check may pass or
  // fail arbitrarily). We tell the iteration code to perform this
  // filtering when it has been determined that there has been an actual
  // allocation in this region and making it safe to check the young type.

  bool card_processed =
    r->oops_on_card_seq_iterate_careful(dirtyRegion,
                                        &filter_then_update_rs_oop_cl,
                                        card_ptr);

  // If unable to process the card then we encountered an unparsable
  // part of the heap (e.g. a partially allocated object) while
  // processing a stale card.  Despite the card being stale, redirty
  // and re-enqueue, because we've already cleaned the card.  Without
  // this we could incorrectly discard a non-stale card.
  if (!card_processed) {
    assert(!_g1->is_gc_active(), "Unparsable heap during GC");
    // The card might have gotten re-dirtied and re-enqueued while we
    // worked.  (In fact, it's pretty likely.)
    if (*card_ptr != CardTableModRefBS::dirty_card_val()) {
      *card_ptr = CardTableModRefBS::dirty_card_val();
      MutexLockerEx x(Shared_DirtyCardQ_lock,
                      Mutex::_no_safepoint_check_flag);
      DirtyCardQueue* sdcq =
        JavaThread::dirty_card_queue_set().shared_dirty_card_queue();
      sdcq->enqueue(card_ptr);
    }
  } else {
    _conc_refine_cards++;
  }

  // This gets set to true if the card being refined has
  // references that point into the collection set.
  bool has_refs_into_cset = trigger_cl.triggered();

  // We should only be detecting that the card contains references
  // that point into the collection set if the current thread is
  // a GC worker thread.
  assert(!has_refs_into_cset || SafepointSynchronize::is_at_safepoint(),
           "invalid result at non safepoint");

  return has_refs_into_cset;
}

上面只是给出这512字节的区域需要处理，但是这个区域里面第一个对象的地址在哪里？这需要遍历该堆分区，跳过这个内存块之前的地址，然后找到第一个对象，把这512字节里面的内存块都作为引用者来处理。这就是为什么会产生浮动垃圾的原因之一。代码如下所示：

hotspot/src/share/vm/gc_implementation/g1/heapRegion.cpp
HeapWord* HeapRegion::oops_on_card_seq_iterate_careful(MemRegion mr,
                                 FilterOutOfRegionClosure* cl,
                                 bool filter_young,
                                 jbyte* card_ptr) {
  if (g1h->is_gc_active()) {
    mr = mr.intersection(MemRegion(bottom(), scan_top()));
  } else {
    mr = mr.intersection(used_region());
  }
  if (mr.is_empty()) return NULL;
  if (is_young() && filter_young)     return NULL;
  // 把卡表改变成clean状态，这是为了说明该内存块正在被处理
  if (card_ptr != NULL) {
    *card_ptr = CardTableModRefBS::clean_card_val();
    OrderAccess::storeload();
  }
  HeapWord* const start = mr.start();
  HeapWord* const end = mr.end();
  HeapWord* cur = block_start(start);
  // 跳过不在处理区域的对象
  oop obj;
  HeapWord* next = cur;
  do {
    cur = next;
    obj = oop(cur);
    if (obj->klass_or_null() == NULL)     return cur;
    next = cur + block_size(cur);
  } while (next <= start);
  // 直到达到这512字节的内存块，然后遍历这个内存块
  do {
    obj = oop(cur);
    if (obj->klass_or_null() == NULL)       return cur;
    cur = cur + block_size(cur);
    // 此处判断对象是否死亡的依据是根据内存的快照，这个在并发标记中会提到
    if (!g1h->is_obj_dead(obj)) {
    // 遍历对象
      if (!obj->is_objArray() || (((HeapWord*)obj) >= start && cur <= end)) 
      {
        obj->oop_iterate(cl);
      } else {
        obj->oop_iterate(cl, mr);
      }
    }
  } while (cur < end);
  return NULL;
}

遍历到的每一个对象都会使用G1UpdateRSOrPushRefOopClosure更新RSet，代码如下所示：

hotspot/src/share/vm/gc_implementation/g1/g1OopClosures.inline.hpp
template <class T> inline void G1UpdateRSOrPushRefOopClosure::do_oop_nv(T* p) {
  oop obj = oopDesc::load_decode_heap_oop(p);
  if (obj == NULL)     return;
  HeapRegion* to = _g1->heap_region_containing(obj);
  // 只处理不同分区之间的引用关系
  if (_from == to)     return;
  if (_record_refs_into_cset && to->in_collection_set()) {
  /* Evac的情况才能进入到这里，对于正常情况把对象放入栈中继续处理，这里主要处理分区内部的引用，只需要复制对象，不必维护引用关系。失败的情况则需要通过特殊路径来处理，参见7.1节*/
    if (!self_forwarded(obj)) {
      // 对于成功转移的对象放入G1ParScanThreadState的队列中处理
      _push_ref_cl->do_oop(p);
      }
  } else {
    to->rem_set()->add_reference(p, _worker_i);
  }
}

更新的方法就是add_reference，这个前面已经提到，就是更新PRT信息。整个RSet更新流程简单一句话总结就是，根据引用者找到被引用者，然后在被引用者所在的分区的RSet中记录引用关系。这里有没有关于并发执行的疑问？会不会存在Refine线程在执行过程中被引用者的地址发生变化，从而不能从引用者准确地找到被引用者对象？这个情况并不会发生，因为在Refine线程执行的过程中并不会发生GC，也不会发生对象的移动，即对象地址都是固定的。

Refinement Zone

Refine线程最主要的工作正如上文所讲就是维护RSet。实际上这也是G1调优中很重要的一部分，据资料测试表明RSet在很多情况下要浪费1%～20%左右的空间，比如100G的空间，有可能高达20G给RSet使用；另一方面，有可能过多RSet的更新会导致Mutator很慢，因为Mutator发现DCQS太满会主动帮助Refine线程处理。这和Refine线程的设计有关。通常我们可以设置多个Refine线程工作，在不同的工作负载下启用的线程不同，这个工作负载通过Refinement Zone控制。G1提供三个值，分别是Green、Yellow和Red，将整个Queue set划分成4个区，姑且称为白、绿、黄和红。

白区：[0，Green)，对于该区，Refine线程并不处理，交由GC线程来处理DCQ。
绿区：[Green，Yellow)，在该区中，Refine线程开始启动，并且根据queue set数值的大小启动不同数量的Refine线程来处理DCQ。
黄区：[Yellow，Red)，在该区，所有的Refine线程（除了抽样线程）都参与DCQ处理。
红区：[Red，+无穷)，在该区，不仅仅所有的Refine线程参与处理RSet，而且连Mutator也参与处理dcq。

这三个值通过三个参数设置：G1ConcRefinementGreenZone、G1ConcRefinementYellowZone、G1ConcRefinementRedZone，默认值都是0。如果没有设置这三个值，G1则自动推断这三个区的阈值大小，如下所示：

G1ConcRefinementGreenZone为ParallelGCThreads。
G1ConcRefinementYellowZone和G1ConcRefinementRedZone是G1ConcRefinementGreenZone的3倍和6倍。这里留一个小小的问题，为什么JDK的设计者要把G1ConcRefinementGreenZone和并行线程数ParallelGCThreads关联？

上面提到在黄区时所有的Refine线程都会参与DCQ处理，那么有多少个线程？这个值可以通过参数G1ConcRefinementThreads设置，默认值为0，当没有设置该值时G1可以启发式推断，设置为ParallelGCThreads。ParallelGCThreads也可以通过参数设置，默认值为0，如果没有设置，G1也可以启发式推断出来，如下所示：

ParallelGCThreads=ncpus，当ncpus小于等于8，ncpus为cpu内核的个数8+(ncpus-8)*5/8，当ncpus>8，ncpus为cpu内核的个数

在绿区的时候，Refine线程会根据DCQS数值的大小启动不同数量的Refine线程，有一个参数用于控制每个Refine线程消费队列的步长，这个参数是：G1ConcRefinementThresholdStep，如果不设置，可以自动推断为：Refine线程+1。假设ParallelGCThreads=4，G1ConcRefinementThreads=3，G1ConcRefinementThresholdStep=黄区个数-绿区个数/（worknum+1），则自动推断为2。绿黄红的个数分别为={4,12,24}。这里将有4个Refine线程，0号线程：DCQS中的DCQ超过4个开始启动，低于4个终止；1号线程：DCQS中的DCQ到达9个开始启动，低于6个终止；2号线程：DCQS中的DCQ达到11个开始启动，低于8个终止，3号线程：处理新生代分区的抽样。当DCQS中的DCQ超过24个时，Mutator开始工作。即DCQS最多24个。

RSet涉及的写屏障

我们一直提到一个概念就是引用关系。Refine主要关注的就是引用关系的变更，更准确地说就是对象的赋值。那么如何识别引用关系的变更？这就需要写屏障。写屏障是指在改变特定内存的值时（实际上也就是写入内存）额外执行的一些动作。在大多数的垃圾回收算法中，都用到了写屏障。

写屏障通常用于在运行时探测并记录回收相关指针（interesting pointer），在回收器只回收堆中部分区域的时候，任何来自该区域外的指针都需要被写屏障捕获，这些指针将会在垃圾回收的时候作为标记开始的根。典型的CMS中也是通过写屏障记录引用关系，G1也是如此。举例来说，每一次将一个老生代对象的引用修改为指向新生代对象，都会被写屏障捕获，并且记录下来。因此在新生代回收的时候，就可以避免扫描整个老生代来查找根。G1垃圾回收器的RSet就是通过写屏障完成的，在写变更的时候通过插入一条额外的代码把引用关系放入到DCQ中，随后Refine线程更新RSet，记录堆分区内部中对象的指针。这种记录发生在写操作之后。对于一个写屏障来说，过滤掉不必要的写操作是十分必要的。这种过滤既能加快赋值器的速度，也能减轻回收器的负担。

G1垃圾回收器采用三重过滤：

不记录新生代到新生代的引用或者新生代到老生代的引用（因为在垃圾回收时，新生代的堆分区都会被会收集），在写屏障时过滤。
过滤掉同一个分区内部引用，在RSet处理时过滤。
过滤掉空引用，在RSet处理时过滤。

过滤掉之后，可以使RSet的大小大大减小。这里还有一个问题，就是何时触发写屏障更新DCQ，关于这一点在混合回收中涉及写屏障时还会更为详细地介绍。G1垃圾回收器的写屏障使用一种两级的缓存结构（用queue set实现）：

线程queue set：每个线程自己的queue set。所有的线程都会把写屏障的记录先放入自己的queue set中，装满了之后，就会把queue set放到global set of filled queue中，而后再申请一个queue set。
global set of filled buffer：所有线程共享的一个全局的、存放填满了的DCQS的集合。

日志解读

为了模拟写屏障，这里给出一个例子，在代码中分配较大的内存以保证这些对象直接分配到老生代中，这样我们就能发现RSet的更多信息，如下所示：

public class RSetTest {
  static Object[] largeObject1 = new Object[1024 * 1024];
  static Object[] largeObject2 = new Object[1024 * 1024];
  static int[] temp;
  public static void main(String[] args) {
    int numGCs = 200;
    for (int k = 0; k < numGCs - 1; k++) {
      for (int i = 0; i < largeObject1.length; i++) {
        largeObject1[i] = largeObject2;
      }
      for (int i = 0; i < largeObject2.length; i++) {
        largeObject2[i] = largeObject1;
      }
      for (int i = 0; i < 1024 ; i++) {
          temp = new int[1024];
      }
      System.gc();
    }
  }
}

通过打开G1TraceConcRefinement观察Refine线程的工作情况：

-Xmx256M -XX:+UseG1GC -XX:G1ConcRefinementThreads=4 
-XX:G1ConcRefinementGreenZone=1 -XX:G1ConcRefinementYellowZone=2 
-XX:G1ConcRefinementRedZone=3 -XX:+UnlockExperimentalVMOptions 
-XX:G1LogLevel=finest -XX:+UnlockDiagnosticVMOptions  
-XX:+G1TraceConcRefinement -XX:+PrintGCTimeStamps

得到的日志如下：

1.725: [Full GC (System.gc())  12M->8854K(29M), 0.0150339 secs]
  [Eden: 5120.0K(10.0M)->0.0B(10.0M) Survivors: 0.0B->0.0B Heap: 
    12.9M(29.0M)->8854.2K(29.0M)], [Metaspace: 3484K->3484K(1056768K)]
  [Times: user=0.01 sys=0.00, real=0.02 secs] 
G1-Refine-activated worker 1, on threshold 1, current 2
G1-Refine-deactivated worker 1, off threshold 1, current 1
G1-Refine-activated worker 1, on threshold 1, current 3
G1-Refine-activated worker 2, on threshold 1, current 2
G1-Refine-deactivated worker 2, off threshold 1, current 1

在这个日志中我们能看到多个Refine线程的工作状况，能看到不同的Refine线程在不同的阈值下激活或者消亡。

通过打开G1SummarizeRSetStats来观察RSet更新的详细信息，如下所示：

-Xmx256M -XX:+UseG1GC  -XX:+UnlockExperimentalVMOptions 
-XX:G1LogLevel=finest -XX:+UnlockDiagnosticVMOptions 
-XX:+G1SummarizeRSetStats  -XX:G1SummarizeRSetStatsPeriod=1  
-XX:+PrintGCTimeStamps

下面是具体的日志：

Cumulative RS summary
  Recent concurrent refinement statistics
    Processed 3110803 cards
    Of 12941 completed buffers:
         12941 ( 100.0%) by concurrent RS threads.
          0 (  0.0%) by mutator threads.
    Did 0 coarsenings.

一共处理了3 110 803个内存块，其中使用了12 941个队列。按照每个队列最大256个元素来就算，最多有3 312 896个元素，这说明在处理的时候有些队列并没有满。其中12 941个队列是由Refine线程处理的，0个是没有Mutator参与处理，0个也表示分区里面的PRT粗粒度化的分区个数为0。由上面的日志可知Refine线程一共有9个，8个用于处理RSet，1个用于抽样。其中有两个Refine线程分别花费200ms和80ms，其他6个线程可能都没有启动：

Concurrent RS threads times (s)
      0.20     0.08     0.00     0.00     0.00     0.00     0.00     0.00
Concurrent sampling threads times (s)
      0.00

这一部分给出的是RSet占用的额外内存空间信息：

Current rem set statistics
  Total per region rem sets sizes = 85K. Max = 4K.
           2K (  3.3%) by 1 Young regions
          31K ( 36.4%) by 10 Humonguous regions
          48K ( 56.5%) by 17 Free regions
           3K (  3.8%) by 1 Old regions
   Static structures = 16K, free_lists = 0K.

这一部分给出的是RSet中PRT表中被设置了多少次，也可以说是内存块被引用了多少次：

16388 occupied cards represented.
        0 (  0.0%) entries by 1 Young regions
    16388 (100.0%) entries by 10 Humonguous regions
        0 (  0.0%) entries by 17 Free regions
        0 (  0.0%) entries by 1 Old regions
Region with largest rem set = 0:(HS)[0x00000000f0000000,0x00000000f0400010,
  0x00000000f0500000], size = 4K, occupied = 8K.

这一部分给出的是HeapRegion中JIT代码的信息：

Total heap region code root sets sizes = 0K.  Max = 0K.
         0K (  1.8%) by 1 Young regions
         0K ( 17.7%) by 10 Humonguous regions
         0K ( 30.1%) by 17 Free regions
         0K ( 50.4%) by 1 Old regions
  16 code roots represented.
          0 (  0.0%) elements by 1 Young regions
          0 (  0.0%) elements by 10 Humonguous regions
          0 (  0.0%) elements by 17 Free regions
         16 (100.0%) elements by 1 Old regions
  Region with largest amount of code roots = 10:(O)[0x00000000f0a00000,
    0x00000000f0aae898,0x00000000f0b00000], size = 0K, num_elems = 0

参数介绍和调优

本章主要讨论G1新引入的Refine线程，用于处理分区间的引用，快速地识别活跃对象。以下是本章涉及的参数以及用法：

·参数G1ConcRefinementThreads，指的是G1 Refine线程的个数，默认值为0，G1可以启发式推断，将并行的线程数ParallelGCThreads作为并发线程数，其中并行线程数可以设置，也可以启发式推断。通常大家不用设置这个参数，并行线程数可以简单总结为CPU个数的5/8，具体的推断方法见上文。
·参数G1UpdateBufferSize，指的是DCQ的长度，默认值是256，增大该值可以保存更多的待处理引用关系。
·参数G1UseAdaptiveConcRefinement，默认值为true，表示可以动态调整Refinement Zone的数字区间，调整的依据在于RSet时间是否满足目标时间。
·参数G1RSetUpdatingPauseTimePercent，默认值为10，即RSet所用的全部时间不超过GC完成时间的10%。如果超过并且设置了参数G1UseAdaptiveConcRefinement为true，更新Green Zone的方法为：当RSet处理时间超过目标时间，Green zone变成原来的0.9倍，否则如果更新的处理过的队列大于Green Zone，增大Green zone为原来的1.1倍，否则不变；对于Yellow Zone和Red Zone分别为Green Zone的3倍和6倍。这里特别要注意的是当动态变化时，可能导致Green Zone为0，那么Yellow Zone和Red Zone都为0，如果这种情况发生，意味着Refine线程不再工作，利用Mutator来处理RSet，这通常绝非我们想要的结果。所以在设置的时候，可以关闭动态调整，或者设置合理的RSet处理时间。关闭动态调整需要有更好的经验，所以设置合理的RSet处理时间更为常见。
·参数G1ConcRefinementThresholdStep，默认值为0，如果没有定义G1会启发式推断，依赖于Yellow Zone和Green Zone。这个值表示的是多个更新RSet的Refine线程对于整个DirtyCardQueueSet的处理步长。
·参数G1ConcRefinementServiceIntervalMillis，默认值为300，表示RS对新生代的抽样线程间隔时间为300ms。
·参数G1ConcRefinementGreenZone，指定Green Zone的大小，默认值为0，G1可以启发式推断。如果设置为0，那么当动态调整关闭，将导致Refine工作线程不工作，如果不进行动态调整，意味着GC会处理所有的队列；如果该值不为0，表示Refine线程在每次工作时会留下这些区域，不处理这些RSet。这个值如果需要设置生效的话，要把动态调整关闭。通常并不设置这个参数。
·参数G1ConcRefinementYellowZone，指定Yellow Zone的大小，默认值为0，G1可以启发式推断，是Green Zone的3倍。
·参数G1ConcRefinementRedZone，指定Red Zone的大小，默认值为0，G1可以启发式推断，是Green Zone的6倍，通常来说并不需要调整G1ConcRefinementGreenZone、G1ConcRefinementYellowZone和G1ConcRefinementRedZone这3个参数，但是如果遇到RSet处理太慢的情况，也可以关闭G1UseAdaptiveConcRefinement，然后根据Refine线程数目设置合理的值。
·参数G1ConcRSLogCacheSize，默认值为10，即存储hot card最多为 $2^{10}$ ，也就是1024个。那么超过1024个该如何处理？实际上JVM设计得很简单，超过1024，直接把老的那个card拿出去处理，相当于认为它不再是hot card。
·参数G1ConcRSHotCardLimit，默认值为4，当一个card被修改4次，则认为是hot card，设计hot card的目的是为了减少该对象修改的次数，因为RSet在被引用的分区存储，所以可能有多个对象引用这个对象，再处理这个对象的时候，可以一次性地把这多个对象都作为根。
·参数G1RSetRegionEntries，默认值为0，G1可以启发式推断。base*(log(region_size/1M)+1)，base的默认值是256，base仅允许在开发版本设置，在发布版本不能更改base。这个值很关键，太小将会导致RSet的粒度从细变粗，导致追踪标记对象将花费更多的时间。另外，从上面的公式中也可以得到：通过调整HeapRegionSize来影响该值的推断，如人工设置HeapRegionSize。实际工作中也可以根据业务情况直接设置该值（如设置为1024）；这样能保持较高的性能，此时每个分区中的细粒度卡表都使用1024项，所有分区中这一部分占用的额外空间加起来就是个不小的数字了，这也是为什么RSet浪费空间的地方。
·参数G1SummarizeRSetStats打印RSet的统计信息，G1SummarizeRSetStatsPeriod=n，表示GC每发生n次就统计一次，默认值是0，表示不会周期性地收集信息。在生产中通常不会使用信息收集。