Refine线程是G1新引入的并发线程池,线程默认数目为G1ConcRefinementThreads+1,它分为两大功能:
- 用于处理新生代分区的抽样,并且在满足响应时间的这个指标下,更新YHR的数目。通常有一个线程来处理。
- 管理RSet,这是Refine最主要的功能。RSet的更新并不是同步完成的,G1会把所有的引用关系都先放入到一个队列中,称为dirty card queue(DCQ),然后使用线程来消费这个队列以完成更新。正常来说有G1ConcRefinementThreads个线程处理;实际上除了Refine线程更新RSet之外,GC线程或者Mutator也可能会更新RSet;DCQ通过Dirty Card Queue Set(DCQS)来管理;为了能够并发地处理,每个Refine线程只负责DCQS中的某几个DCQ。
对于处理DirtyCard的Refine线程有两个关注点:Mutator如何把引用对象放入到DCQS供Refine线程处理,以及当Refine线程太忙的话Mutator如何帮助线程。我们先介绍比较独立的抽样线程,再介绍一般的Refine线程。
抽样线程
Refine线程池中的最后一个线程就是抽样线程,它的主要作用是设置新生代分区的个数,使G1满足垃圾回收的预测停顿时间。抽样线程的代码在run_young_rs_sampling,如下所示:
void ConcurrentG1RefineThread::run_young_rs_sampling() {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
_vtime_start = os::elapsedVTime();
while(!_should_terminate) {
sample_young_list_rs_lengths();
if (os::supports_vtime()) {
_vtime_accum = (os::elapsedVTime() - _vtime_start);
} else {
_vtime_accum = 0.0;
}
MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
if (_should_terminate) {
break;
}
/*可以看到这里使用参数G1ConcRefinementServiceIntervalMillis控制抽样线程运行的频度,
生产中如果发现采样不足可以减少该时间,如果系统运行稳定满足预测时间,可以增大该值减少采样*/
_monitor->wait(Mutex::_no_safepoint_check_flag, G1ConcRefinementServiceIntervalMillis);
}
}
hotspot/src/share/vm/gc_implementation/g1/concurrentG1RefineThread.cpp
void ConcurrentG1RefineThread::sample_young_list_rs_lengths() {
SuspendibleThreadSetJoiner sts;
G1CollectedHeap* g1h = G1CollectedHeap::heap();
G1CollectorPolicy* g1p = g1h->g1_policy();
if (g1p->adaptive_young_list_length()) {
int regions_visited = 0;
g1h->young_list()->rs_length_sampling_init();
// young_list是所有新生代分区形成的一个链表
while (g1h->young_list()->rs_length_sampling_more()) {
/*这里的关键是rs_length_sampling_next,其值为在本次循环中有多少个分区可以加入到新生代分区,
其思路也非常简单:当前分区有多少个引用的分区,包括稀疏、细粒度和粗粒度的分区个数,把这个数字
加入到新生代总回收的要处理的分区数目。从这里也可以看到停顿时间指回收新生代分区要花费的时间,
这个时间当然也包括分区之间引用的处理*/
g1h->young_list()->rs_length_sampling_next();
++regions_visited;
// 每10次即每处理10个分区,主动让出CPU,目的是为了在GC发生时VMThread
// 能顺利进入到安全点,关于进入安全点的详细解释参见第10章
if (regions_visited == 10) {
if (sts.should_yield()) {
sts.yield();
break;
}
regions_visited = 0;
}
}
// 这里就是利用上面的抽样数据更新新生代分区数目
g1p->revise_young_list_target_length_if_necessary();
}
}
修正新生代分区数目的代码如下所示:
src\share\vm\gc_implementation\g1\g1CollectorPolicy.cpp
void G1CollectorPolicy::revise_young_list_target_length_if_necessary() {
guarantee( adaptive_young_list_length(), "should not call this otherwise" );
size_t rs_lengths = _g1->young_list()->sampled_rs_lengths();
if (rs_lengths > _rs_lengths_prediction) {
// add 10% to avoid having to recalculate often
size_t rs_lengths_prediction = rs_lengths * 1100 / 1000;
update_young_list_target_length(rs_lengths_prediction);
}
}
具体的计算方式在update_young_list_target_length,传递的参数就是我们采样得到的分区数目。在预测时,还需要考虑最小分区的下限和上限,不过代码逻辑并不复杂,特别是理解了停顿预测模型的思路,很容易读懂,源代码如下:
src\share\vm\gc_implementation\g1\g1CollectorPolicy.cpp
void G1CollectorPolicy::update_young_list_target_length(size_t rs_lengths) {
if (rs_lengths == (size_t) -1) {
// if it's set to the default value (-1), we should predict it;
// otherwise, use the given value.
rs_lengths = (size_t) get_new_prediction(_rs_lengths_seq);
}
// Calculate the absolute and desired min bounds.
// This is how many young regions we already have (currently: the survivors).
uint base_min_length = recorded_survivor_regions();
// This is the absolute minimum young length, which ensures that we
// can allocate one eden region in the worst-case.
uint absolute_min_length = base_min_length + 1;
uint desired_min_length =
calculate_young_list_desired_min_length(base_min_length);
if (desired_min_length < absolute_min_length) {
desired_min_length = absolute_min_length;
}
// Calculate the absolute and desired max bounds.
// We will try our best not to "eat" into the reserve.
uint absolute_max_length = 0;
if (_free_regions_at_end_of_collection > _reserve_regions) {
absolute_max_length = _free_regions_at_end_of_collection - _reserve_regions;
}
uint desired_max_length = calculate_young_list_desired_max_length();
if (desired_max_length > absolute_max_length) {
desired_max_length = absolute_max_length;
}
uint young_list_target_length = 0;
if (adaptive_young_list_length()) {
if (gcs_are_young()) {
young_list_target_length =
calculate_young_list_target_length(rs_lengths,
base_min_length,
desired_min_length,
desired_max_length);
_rs_lengths_prediction = rs_lengths;
} else {
// Don't calculate anything and let the code below bound it to
// the desired_min_length, i.e., do the next GC as soon as
// possible to maximize how many old regions we can add to it.
}
} else {
// The user asked for a fixed young gen so we'll fix the young gen
// whether the next GC is young or mixed.
young_list_target_length = _young_list_fixed_length;
}
// Make sure we don't go over the desired max length, nor under the
// desired min length. In case they clash, desired_min_length wins
// which is why that test is second.
if (young_list_target_length > desired_max_length) {
young_list_target_length = desired_max_length;
}
if (young_list_target_length < desired_min_length) {
young_list_target_length = desired_min_length;
}
assert(young_list_target_length > recorded_survivor_regions(),
"we should be able to allocate at least one eden region");
assert(young_list_target_length >= absolute_min_length, "post-condition");
_young_list_target_length = young_list_target_length;
update_max_gc_locker_expansion();
}
管理RSet
前面提到RSet用于管理对象引用关系,但是我们并没有提及怎么管理这种关系。G1中使用Refine线程异步地维护和管理引用关系。因为要异步处理,所以必须有一个数据结构来维护这些需要引用的对象。JVM在设计的时候,声明了一个全局的静态变量DirtyCardQueueSet(DCQS),DCQS里面存放的是DCQ,为了性能的考虑,所有处理引用关系的线程共享一个DCQS,每个Mutator(线程)在初始化的时候都关联这个DCQS。
src\share\vm\gc_implementation\g1\dirtyCardQueue.hpp
// A ptrQueue whose elements are "oops", pointers to object heads.
class DirtyCardQueue: public PtrQueue {
public:
DirtyCardQueue(PtrQueueSet* qset_, bool perm = false) :
// Dirty card queues are always active, so we create them with their
// active field set to true.
PtrQueue(qset_, perm, true /* active */) { }
// Flush before destroying; queue may be used to capture pending work while
// doing something else, with auto-flush on completion.
~DirtyCardQueue() { if (!is_permanent()) flush(); }
// Process queue entries and release resources.
void flush() { flush_impl(); }
// Apply the closure to all elements, and reset the index to make the
// buffer empty. If a closure application returns "false", return
// "false" immediately, halting the iteration. If "consume" is true,
// deletes processed entries from logs.
bool apply_closure(CardTableEntryClosure* cl,
bool consume = true,
uint worker_i = 0);
// Apply the closure to all elements of "buf", down to "index"
// (inclusive.) If returns "false", then a closure application returned
// "false", and we return immediately. If "consume" is true, entries are
// set to NULL as they are processed, so they will not be processed again
// later.
static bool apply_closure_to_buffer(CardTableEntryClosure* cl,
void** buf, size_t index, size_t sz,
bool consume = true,
uint worker_i = 0);
void **get_buf() { return _buf;}
void set_buf(void **buf) {_buf = buf;}
size_t get_index() { return _index;}
void reinitialize() { _buf = 0; _sz = 0; _index = 0;}
};
bool DirtyCardQueue::apply_closure(CardTableEntryClosure* cl,
bool consume,
uint worker_i) {
bool res = true;
if (_buf != NULL) {
res = apply_closure_to_buffer(cl, _buf, _index, _sz,
consume,
worker_i);
if (res && consume) _index = _sz;
}
return res;
}
bool DirtyCardQueue::apply_closure_to_buffer(CardTableEntryClosure* cl,
void** buf,
size_t index, size_t sz,
bool consume,
uint worker_i) {
if (cl == NULL) return true;
for (size_t i = index; i < sz; i += oopSize) {
int ind = byte_index_to_index((int)i);
jbyte* card_ptr = (jbyte*)buf[ind];
if (card_ptr != NULL) {
// Set the entry to null, so we don't do it again (via the test
// above) if we reconsider this buffer.
if (consume) buf[ind] = NULL;
if (!cl->do_card_ptr(card_ptr, worker_i)) return false;
}
}
return true;
}
src\share\vm\gc_implementation\g1\dirtyCardQueue.hpp
class DirtyCardQueueSet: public PtrQueueSet {
// The closure used in mut_process_buffer().
CardTableEntryClosure* _mut_process_closure;
DirtyCardQueue _shared_dirty_card_queue;
// Override.
bool mut_process_buffer(void** buf);
// Protected by the _cbl_mon.
FreeIdSet* _free_ids;
// The number of completed buffers processed by mutator and rs thread,
// respectively.
jint _processed_buffers_mut;
jint _processed_buffers_rs_thread;
// Current buffer node used for parallel iteration.
BufferNode* volatile _cur_par_buffer_node;
public:
DirtyCardQueueSet(bool notify_when_complete = true);
void initialize(CardTableEntryClosure* cl, Monitor* cbl_mon, Mutex* fl_lock,
int process_completed_threshold,
int max_completed_queue,
Mutex* lock, PtrQueueSet* fl_owner = NULL);
// The number of parallel ids that can be claimed to allow collector or
// mutator threads to do card-processing work.
static uint num_par_ids();
static void handle_zero_index_for_thread(JavaThread* t);
// Apply the given closure to all entries in all currently-active buffers.
// This should only be applied at a safepoint. (Currently must not be called
// in parallel; this should change in the future.) If "consume" is true,
// processed entries are discarded.
void iterate_closure_all_threads(CardTableEntryClosure* cl,
bool consume = true,
uint worker_i = 0);
// If there exists some completed buffer, pop it, then apply the
// specified closure to all its elements, nulling out those elements
// processed. If all elements are processed, returns "true". If no
// completed buffers exist, returns false. If a completed buffer exists,
// but is only partially completed before a "yield" happens, the
// partially completed buffer (with its processed elements set to NULL)
// is returned to the completed buffer set, and this call returns false.
bool apply_closure_to_completed_buffer(CardTableEntryClosure* cl,
uint worker_i = 0,
int stop_at = 0,
bool during_pause = false);
// Helper routine for the above.
bool apply_closure_to_completed_buffer_helper(CardTableEntryClosure* cl,
uint worker_i,
BufferNode* nd);
BufferNode* get_completed_buffer(int stop_at);
// Applies the current closure to all completed buffers,
// non-consumptively.
void apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl);
void reset_for_par_iteration() { _cur_par_buffer_node = _completed_buffers_head; }
// Applies the current closure to all completed buffers, non-consumptively.
// Parallel version.
void par_apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl);
DirtyCardQueue* shared_dirty_card_queue() {
return &_shared_dirty_card_queue;
}
// Deallocate any completed log buffers
void clear();
// If a full collection is happening, reset partial logs, and ignore
// completed ones: the full collection will make them all irrelevant.
void abandon_logs();
// If any threads have partial logs, add them to the global list of logs.
void concatenate_logs();
void clear_n_completed_buffers() { _n_completed_buffers = 0;}
jint processed_buffers_mut() {
return _processed_buffers_mut;
}
jint processed_buffers_rs_thread() {
return _processed_buffers_rs_thread;
}
};
DirtyCardQueueSet::DirtyCardQueueSet(bool notify_when_complete) :
PtrQueueSet(notify_when_complete),
_mut_process_closure(NULL),
_shared_dirty_card_queue(this, true /*perm*/),
_free_ids(NULL),
_processed_buffers_mut(0), _processed_buffers_rs_thread(0)
{
_all_active = true;
}
// Determines how many mutator threads can process the buffers in parallel.
uint DirtyCardQueueSet::num_par_ids() {
return (uint)os::initial_active_processor_count();
}
void DirtyCardQueueSet::initialize(CardTableEntryClosure* cl, Monitor* cbl_mon, Mutex* fl_lock,
int process_completed_threshold,
int max_completed_queue,
Mutex* lock, PtrQueueSet* fl_owner) {
_mut_process_closure = cl;
PtrQueueSet::initialize(cbl_mon, fl_lock, process_completed_threshold,
max_completed_queue, fl_owner);
set_buffer_size(G1UpdateBufferSize);
_shared_dirty_card_queue.set_lock(lock);
_free_ids = new FreeIdSet((int) num_par_ids(), _cbl_mon);
}
void DirtyCardQueueSet::handle_zero_index_for_thread(JavaThread* t) {
t->dirty_card_queue().handle_zero_index();
}
void DirtyCardQueueSet::iterate_closure_all_threads(CardTableEntryClosure* cl,
bool consume,
uint worker_i) {
assert(SafepointSynchronize::is_at_safepoint(), "Must be at safepoint.");
for(JavaThread* t = Threads::first(); t; t = t->next()) {
bool b = t->dirty_card_queue().apply_closure(cl, consume);
guarantee(b, "Should not be interrupted.");
}
bool b = shared_dirty_card_queue()->apply_closure(cl,
consume,
worker_i);
guarantee(b, "Should not be interrupted.");
}
bool DirtyCardQueueSet::mut_process_buffer(void** buf) {
// Used to determine if we had already claimed a par_id
// before entering this method.
bool already_claimed = false;
// We grab the current JavaThread.
JavaThread* thread = JavaThread::current();
// We get the the number of any par_id that this thread
// might have already claimed.
uint worker_i = thread->get_claimed_par_id();
// If worker_i is not UINT_MAX then the thread has already claimed
// a par_id. We make note of it using the already_claimed value
if (worker_i != UINT_MAX) {
already_claimed = true;
} else {
// Otherwise we need to claim a par id
worker_i = _free_ids->claim_par_id();
// And store the par_id value in the thread
thread->set_claimed_par_id(worker_i);
}
bool b = false;
if (worker_i != UINT_MAX) {
b = DirtyCardQueue::apply_closure_to_buffer(_mut_process_closure, buf, 0,
_sz, true, worker_i);
if (b) Atomic::inc(&_processed_buffers_mut);
// If we had not claimed an id before entering the method
// then we must release the id.
if (!already_claimed) {
// we release the id
_free_ids->release_par_id(worker_i);
// and set the claimed_id in the thread to UINT_MAX
thread->set_claimed_par_id(UINT_MAX);
}
}
return b;
}
BufferNode*
DirtyCardQueueSet::get_completed_buffer(int stop_at) {
BufferNode* nd = NULL;
MutexLockerEx x(_cbl_mon, Mutex::_no_safepoint_check_flag);
if ((int)_n_completed_buffers <= stop_at) {
_process_completed = false;
return NULL;
}
if (_completed_buffers_head != NULL) {
nd = _completed_buffers_head;
_completed_buffers_head = nd->next();
if (_completed_buffers_head == NULL)
_completed_buffers_tail = NULL;
_n_completed_buffers--;
assert(_n_completed_buffers >= 0, "Invariant");
}
debug_only(assert_completed_buffer_list_len_correct_locked());
return nd;
}
bool DirtyCardQueueSet::
apply_closure_to_completed_buffer_helper(CardTableEntryClosure* cl,
uint worker_i,
BufferNode* nd) {
if (nd != NULL) {
void **buf = BufferNode::make_buffer_from_node(nd);
size_t index = nd->index();
bool b =
DirtyCardQueue::apply_closure_to_buffer(cl, buf,
index, _sz,
true, worker_i);
if (b) {
deallocate_buffer(buf);
return true; // In normal case, go on to next buffer.
} else {
enqueue_complete_buffer(buf, index);
return false;
}
} else {
return false;
}
}
bool DirtyCardQueueSet::apply_closure_to_completed_buffer(CardTableEntryClosure* cl,
uint worker_i,
int stop_at,
bool during_pause) {
assert(!during_pause || stop_at == 0, "Should not leave any completed buffers during a pause");
BufferNode* nd = get_completed_buffer(stop_at);
bool res = apply_closure_to_completed_buffer_helper(cl, worker_i, nd);
if (res) Atomic::inc(&_processed_buffers_rs_thread);
return res;
}
void DirtyCardQueueSet::apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl) {
BufferNode* nd = _completed_buffers_head;
while (nd != NULL) {
bool b =
DirtyCardQueue::apply_closure_to_buffer(cl,
BufferNode::make_buffer_from_node(nd),
0, _sz, false);
guarantee(b, "Should not stop early.");
nd = nd->next();
}
}
void DirtyCardQueueSet::par_apply_closure_to_all_completed_buffers(CardTableEntryClosure* cl) {
BufferNode* nd = _cur_par_buffer_node;
while (nd != NULL) {
BufferNode* next = (BufferNode*)nd->next();
BufferNode* actual = (BufferNode*)Atomic::cmpxchg_ptr((void*)next, (volatile void*)&_cur_par_buffer_node, (void*)nd);
if (actual == nd) {
bool b =
DirtyCardQueue::apply_closure_to_buffer(cl,
BufferNode::make_buffer_from_node(actual),
0, _sz, false);
guarantee(b, "Should not stop early.");
nd = next;
} else {
nd = actual;
}
}
}
// Deallocates any completed log buffers
void DirtyCardQueueSet::clear() {
BufferNode* buffers_to_delete = NULL;
{
MutexLockerEx x(_cbl_mon, Mutex::_no_safepoint_check_flag);
while (_completed_buffers_head != NULL) {
BufferNode* nd = _completed_buffers_head;
_completed_buffers_head = nd->next();
nd->set_next(buffers_to_delete);
buffers_to_delete = nd;
}
_n_completed_buffers = 0;
_completed_buffers_tail = NULL;
debug_only(assert_completed_buffer_list_len_correct_locked());
}
while (buffers_to_delete != NULL) {
BufferNode* nd = buffers_to_delete;
buffers_to_delete = nd->next();
deallocate_buffer(BufferNode::make_buffer_from_node(nd));
}
}
void DirtyCardQueueSet::abandon_logs() {
assert(SafepointSynchronize::is_at_safepoint(), "Must be at safepoint.");
clear();
// Since abandon is done only at safepoints, we can safely manipulate
// these queues.
for (JavaThread* t = Threads::first(); t; t = t->next()) {
t->dirty_card_queue().reset();
}
shared_dirty_card_queue()->reset();
}
void DirtyCardQueueSet::concatenate_logs() {
// Iterate over all the threads, if we find a partial log add it to
// the global list of logs. Temporarily turn off the limit on the number
// of outstanding buffers.
int save_max_completed_queue = _max_completed_queue;
_max_completed_queue = max_jint;
assert(SafepointSynchronize::is_at_safepoint(), "Must be at safepoint.");
for (JavaThread* t = Threads::first(); t; t = t->next()) {
DirtyCardQueue& dcq = t->dirty_card_queue();
if (dcq.size() != 0) {
void **buf = t->dirty_card_queue().get_buf();
// We must NULL out the unused entries, then enqueue.
for (size_t i = 0; i < t->dirty_card_queue().get_index(); i += oopSize) {
buf[PtrQueue::byte_index_to_index((int)i)] = NULL;
}
enqueue_complete_buffer(dcq.get_buf(), dcq.get_index());
dcq.reinitialize();
}
}
if (_shared_dirty_card_queue.size() != 0) {
enqueue_complete_buffer(_shared_dirty_card_queue.get_buf(),
_shared_dirty_card_queue.get_index());
_shared_dirty_card_queue.reinitialize();
}
// Restore the completed buffer queue limit.
_max_completed_queue = save_max_completed_queue;
}
每个Mutator都有一个私有的队列,每个队列的最大长度由G1UpdateBufferSize(默认值为256)确定,即最多存放256个引用关系对象,在本线程中如果产生新的对象引用关系则把引用者放入DCQ中,当满256个时,就会把这个队列放入到DCQS中(DCQS可以被所有线程共享,所以放入时需要加锁),当然可以手动提交当前线程的队列(当队列还没有满的时候,提交时要指明有多少个引用关系)。而DCQ的处理则是通过Refine线程。DCQS初始化代码如下:
src\share\vm\gc_implementation\g1\g1CollectedHeap.cpp
JavaThread::satb_mark_queue_set().initialize(SATB_Q_CBL_mon,
SATB_Q_FL_lock,
G1SATBProcessCompletedThreshold,
Shared_SATB_Q_lock);
JavaThread::dirty_card_queue_set().initialize(_refine_cte_cl,
DirtyCardQ_CBL_mon,
DirtyCardQ_FL_lock,
concurrent_g1_refine()->yellow_zone(),
concurrent_g1_refine()->red_zone(),
Shared_DirtyCardQ_lock);
dirty_card_queue_set().initialize(NULL, // Should never be called by the Java code
DirtyCardQ_CBL_mon,
DirtyCardQ_FL_lock,
-1, // never trigger processing
-1, // no limit on length
Shared_DirtyCardQ_lock,
&JavaThread::dirty_card_queue_set());
// Initialize the card queue set used to hold cards containing
// references into the collection set.
_into_cset_dirty_card_queue_set.initialize(NULL, // Should never be called by the Java code
DirtyCardQ_CBL_mon,
DirtyCardQ_FL_lock,
-1, // never trigger processing
-1, // no limit on length
Shared_DirtyCardQ_lock,
&JavaThread::dirty_card_queue_set());
在这里有一个全局的Monitor,即DirtyCardQ_CBL_mon,它的目的是什么?我们知道任意的Mutator都可以通过JavaThread中的静态方法找到DCQS这个静态成员变量,每当DCQ满了之后都会把这个DCQ加入到DCQS中。当DCQ加入成功,并且满足一定条件时(这里的条件是DCQS中DCQ的个数大于一个阈值,这个阈值和后文的Green Zone相关),调用就是通过这个Monitor发送Notify通知0号Refine线程启动。因为0号Refine线程可能会被任意一个Mutator来通知,所以这里的Monitor是一个全局变量,可以被任意的Mutator访问。
把DCQ加入到DCQS的方法是enqueue_complete_buffer,它定义在PtrQueueSet中,PtrQueueSet是DirtyCardQueueSet的父类。enqueue_complete_buffer是通过process_or_enqueue_complete_buffer完成添加的。在process_or_enqueue_complete_buffer中如果Mutator发现DCQS已经满了,那么就不继续往DCQS中添加了,这个时候说明引用变更太多了,Refine线程负载太重,这个Mutator就会暂停其他代码执行,替代Refine线程来更新RSet。把对象加入到DCQ的代码如下所示:
hotspot/src/share/vm/gc_implementation/g1/ptrQueue.hpp
// 把对象放入到DCQ中,实际上DCQ就是一个buffer
void PtrQueue::enqueue(void* ptr) {
if (!_active) return;
else enqueue_known_active(ptr);
}
上面的enqueue_known_active就是判断当前DCQ是否还有空间,如果有则直接加入,如果没有则调用handle_zero_index,它再调用process_or_enqueue_complete_buffer并根据返回值决定是否申请新的DCQ,代码如下所示:
hotspot/src/share/vm/gc_implementation/g1/ptrQueue.cpp
void PtrQueue::enqueue_known_active(void* ptr) {
// index为0,表示DCQ已经满了,需要把DCQ加入到DCQS中,并申请新的DCQ
while (_index == 0) {
handle_zero_index();
}
// 在这里,无论如何都会有合适的DCQ可以使用,因为满的DCQ会申请新的。直接加入对象
_index -= oopSize;
_buf[byte_index_to_index((int)_index)] = ptr;
}
// 下面就是处理DCQ满的情况
void PtrQueue::handle_zero_index() {
// 这里先进行二次判断,是为了防止DCQ满的情况下同一线程多次进入分配
if (_buf != NULL) {
if (!should_enqueue_buffer()) {
return;
}
if (_lock) {
/*进入这里,说明使用的是全局的DCQ。这里需要考虑多线程的情况。大体可以总结为:把全局DCQ放入到DCQS中,然后再为全局的DCQ申请新的空间。这里引入一个局部变量buf的目的在于处理多线程的竞争。*/
void** buf = _buf; // local pointer to completed buffer
_buf = NULL; // clear shared _buf field
/*这里的locking_enqueue_completed_buffer和后面的enqueue_completed_buffer几乎是一样的,唯一的区别就是锁的处理,因为这里是全局DCQ所以涉及加锁和解锁。*/
locking_enqueue_completed_buffer(buf);
// 如果_buf不为null,说明其他的线程已经成功地为全局DCQ申请到空间了,直接返回
if (_buf != NULL) return;
} else {
// 此处就是普通的DCQ处理
if (qset()->process_or_enqueue_complete_buffer(_buf)) {
// 返回值为真,说明Mutator暂停执行应用代码,帮助处理DCQ,所以此时可以重用DCQ
_sz = qset()->buffer_size();
_index = _sz;
return;
}
}
}
// 为DCQ申请新的空间
_buf = qset()->allocate_buffer();
_sz = qset()->buffer_size();
_index = _sz;
}
// 处理DCQ,根据情况判定是否需要Mutator介入
bool PtrQueueSet::process_or_enqueue_complete_buffer(void** buf) {
if (Thread::current()->is_Java_thread()) {
// 条件为真,就说明需要Mutator介入这里没有加锁,允许一定的竞争,原因在于如果条件不满足
// 最坏的后果就是Mutator处理
if (_max_completed_queue == 0 || _max_completed_queue > 0 &&
_n_completed_buffers >= _max_completed_queue + _completed_queue_padding) {
bool b = mut_process_buffer(buf);
if (b) return true;
}
}
// 把buffer加入到DCQS中,注意这里加入之后调用者将会分配一个新的buffer
// 是否生成新的buffer依赖于返回值,false表示需要新的buffer
enqueue_complete_buffer(buf);
return false;
}
// 其实这个函数也非常简单,就是DCQ形成一个链表
void PtrQueueSet::enqueue_complete_buffer(void** buf, size_t index) {
MutexLockerEx x(_cbl_mon, Mutex::_no_safepoint_check_flag);
BufferNode* cbn = BufferNode::new_from_buffer(buf);
cbn->set_index(index);
if (_completed_buffers_tail == NULL) {
assert(_completed_buffers_head == NULL, "Well-formedness");
_completed_buffers_head = cbn;
_completed_buffers_tail = cbn;
} else {
_completed_buffers_tail->set_next(cbn);
_completed_buffers_tail = cbn;
}
_n_completed_buffers++;
// 这里是判断是否需要有Refine线程工作,如果没有线程工作通过notify通知启动
if (!_process_completed && _process_completed_threshold >= 0 &&
_n_completed_buffers >= _process_completed_threshold) {
_process_completed = true;
if (_notify_when_complete)
// 这里其实就是通知0号Refine线程
_cbl_mon->notify();
}
}
我们提到当Refine线程忙不过来的时候,G1让Mutator帮忙处理引用变更。当然Refine线程个数可以由用户设置,但是通过上面数据结构的描述,可以发现仍然可能存在因对象引用修改太多,导致Refine线程太忙,处理不过来。所以Mutator来处理引用变更,就会导致业务暂停处理,如果发生了这种情况,说明修改太多,或者Refine数目设置得太少。我们可以通过参数G1SummarizeRSetStats打开RSet处理过程中的日志,从中能发现处理线程的信息。下面我们看一下Mutator是如何处理DCQ的。
Mutator处理DCQ
队列set的最大长度依赖于Refine线程的个数,最大为Red Zone的个数(关于Red Zone见下一节介绍,这里简单理解为一个数字),当队列set里面的队列个数超过Red Zone的个数时,提交队列的Mutator就不能把这个队列放入到set中,此时,Mutator就会直接处理这个队列的引用。代码如下:
bool DirtyCardQueueSet::mut_process_buffer(void** buf) {
// Used to determine if we had already claimed a par_id
// before entering this method.
bool already_claimed = false;
// We grab the current JavaThread.
JavaThread* thread = JavaThread::current();
// We get the the number of any par_id that this thread
// might have already claimed.
uint worker_i = thread->get_claimed_par_id();
// If worker_i is not UINT_MAX then the thread has already claimed
// a par_id. We make note of it using the already_claimed value
if (worker_i != UINT_MAX) {
already_claimed = true;
} else {
// Otherwise we need to claim a par id
worker_i = _free_ids->claim_par_id();
// And store the par_id value in the thread
thread->set_claimed_par_id(worker_i);
}
bool b = false;
if (worker_i != UINT_MAX) {
b = DirtyCardQueue::apply_closure_to_buffer(_mut_process_closure, buf, 0,
_sz, true, worker_i);
if (b) Atomic::inc(&_processed_buffers_mut);
// If we had not claimed an id before entering the method
// then we must release the id.
if (!already_claimed) {
// we release the id
_free_ids->release_par_id(worker_i);
// and set the claimed_id in the thread to UINT_MAX
thread->set_claimed_par_id(UINT_MAX);
}
}
return b;
}
Refine线程的工作原理
Refine线程的初始化是在GC管理器初始化的时候进行,但是如果没有足够多的引用关系变更,这些Refine线程都是空转,所以需要一个机制能动态激活和冻结线程,JVM通过wait和notify机制来实现。设计思想是:从0到n-1线程(n表示Refine线程的个数),都是由前一个线程发现自己太忙,激活后一个;后一个线程发现自己太闲的时候则主动冻结自己。那么第0个线程在何时被激活?第0个线程是由正在运行的Java线程来激活的,当Java线程(Mutator)尝试把修改的引用放入到队列时,如果0号线程还没激活,则发送notify信号激活它。所以在设计的时候,0号线程可能会由任意一个Mutator来通知,而1号到n-1号线程只能有前一个标号的Refine线程通知。因为0号线程可以由任意Mutator通知,所以0号线程等待的Monitor是一个全局变量,而1号到n-1号线程中的Monitor则是局部变量。
src\share\vm\gc_implementation\g1\concurrentG1RefineThread.hpp
// The G1 Concurrent Refinement Thread (could be several in the future).
class ConcurrentG1RefineThread: public ConcurrentGCThread {
friend class VMStructs;
friend class G1CollectedHeap;
double _vtime_start; // Initial virtual time.
double _vtime_accum; // Initial virtual time.
uint _worker_id;
uint _worker_id_offset;
// The refinement threads collection is linked list. A predecessor can activate a successor
// when the number of the rset update buffer crosses a certain threshold. A successor
// would self-deactivate when the number of the buffers falls below the threshold.
bool _active;
ConcurrentG1RefineThread* _next;
Monitor* _monitor;
ConcurrentG1Refine* _cg1r;
// The closure applied to completed log buffers.
CardTableEntryClosure* _refine_closure;
int _thread_threshold_step;
// This thread activation threshold
int _threshold;
// This thread deactivation threshold
int _deactivation_threshold;
void sample_young_list_rs_lengths();
void run_young_rs_sampling();
void wait_for_completed_buffers();
void set_active(bool x) { _active = x; }
bool is_active();
void activate();
void deactivate();
public:
virtual void run();
// Constructor
ConcurrentG1RefineThread(ConcurrentG1Refine* cg1r, ConcurrentG1RefineThread* next,
CardTableEntryClosure* refine_closure,
uint worker_id_offset, uint worker_id);
void initialize();
// Printing
void print() const;
void print_on(outputStream* st) const;
// Total virtual time so far.
double vtime_accum() { return _vtime_accum; }
ConcurrentG1Refine* cg1r() { return _cg1r; }
// shutdown
void stop();
};
Refine线程的主要工作在run方法中,代码如下:
src\share\vm\gc_implementation\g1\concurrentG1RefineThread.cpp
ConcurrentG1RefineThread::ConcurrentG1RefineThread(ConcurrentG1Refine* cg1r, ConcurrentG1RefineThread *next,
CardTableEntryClosure* refine_closure,
uint worker_id_offset, uint worker_id) :
ConcurrentGCThread(),
_refine_closure(refine_closure),
_worker_id_offset(worker_id_offset),
_worker_id(worker_id),
_active(false),
_next(next),
_monitor(NULL),
_cg1r(cg1r),
_vtime_accum(0.0)
{
// Each thread has its own monitor. The i-th thread is responsible for signalling
// to thread i+1 if the number of buffers in the queue exceeds a threashold for this
// thread. Monitors are also used to wake up the threads during termination.
// The 0th worker in notified by mutator threads and has a special monitor.
// The last worker is used for young gen rset size sampling.
if (worker_id > 0) {
_monitor = new Monitor(Mutex::nonleaf, "Refinement monitor", true);
} else {
_monitor = DirtyCardQ_CBL_mon;
}
initialize();
create_and_start();
}
void ConcurrentG1RefineThread::initialize() {
if (_worker_id < cg1r()->worker_thread_num()) {
// Current thread activation threshold
_threshold = MIN2<int>(cg1r()->thread_threshold_step() * (_worker_id + 1) + cg1r()->green_zone(),
cg1r()->yellow_zone());
// A thread deactivates once the number of buffer reached a deactivation threshold
_deactivation_threshold = MAX2<int>(_threshold - cg1r()->thread_threshold_step(), cg1r()->green_zone());
} else {
set_active(true);
}
}
void ConcurrentG1RefineThread::sample_young_list_rs_lengths() {
SuspendibleThreadSetJoiner sts;
G1CollectedHeap* g1h = G1CollectedHeap::heap();
G1CollectorPolicy* g1p = g1h->g1_policy();
if (g1p->adaptive_young_list_length()) {
int regions_visited = 0;
g1h->young_list()->rs_length_sampling_init();
while (g1h->young_list()->rs_length_sampling_more()) {
g1h->young_list()->rs_length_sampling_next();
++regions_visited;
// we try to yield every time we visit 10 regions
if (regions_visited == 10) {
if (sts.should_yield()) {
sts.yield();
// we just abandon the iteration
break;
}
regions_visited = 0;
}
}
g1p->revise_young_list_target_length_if_necessary();
}
}
void ConcurrentG1RefineThread::run_young_rs_sampling() {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
_vtime_start = os::elapsedVTime();
while(!_should_terminate) {
sample_young_list_rs_lengths();
if (os::supports_vtime()) {
_vtime_accum = (os::elapsedVTime() - _vtime_start);
} else {
_vtime_accum = 0.0;
}
MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
if (_should_terminate) {
break;
}
_monitor->wait(Mutex::_no_safepoint_check_flag, G1ConcRefinementServiceIntervalMillis);
}
}
void ConcurrentG1RefineThread::wait_for_completed_buffers() {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
while (!_should_terminate && !is_active()) {
_monitor->wait(Mutex::_no_safepoint_check_flag);
}
}
bool ConcurrentG1RefineThread::is_active() {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
return _worker_id > 0 ? _active : dcqs.process_completed_buffers();
}
void ConcurrentG1RefineThread::activate() {
MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
if (_worker_id > 0) {
if (G1TraceConcRefinement) {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
gclog_or_tty->print_cr("G1-Refine-activated worker %d, on threshold %d, current %d",
_worker_id, _threshold, (int)dcqs.completed_buffers_num());
}
set_active(true);
} else {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
dcqs.set_process_completed(true);
}
_monitor->notify();
}
void ConcurrentG1RefineThread::deactivate() {
MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
if (_worker_id > 0) {
if (G1TraceConcRefinement) {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
gclog_or_tty->print_cr("G1-Refine-deactivated worker %d, off threshold %d, current %d",
_worker_id, _deactivation_threshold, (int)dcqs.completed_buffers_num());
}
set_active(false);
} else {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
dcqs.set_process_completed(false);
}
}
void ConcurrentG1RefineThread::run() {
// 初始化线程私有信息
initialize_in_thread();
wait_for_universe_init();
// Refine的最后一个线程用于处理YHR的抽样,抽样的作用在前面已经提到,
// 就是为了预测停顿时间并调整分区数目
if (_worker_id >= cg1r()->worker_thread_num()) {
run_young_rs_sampling();
terminate();
return;
}
_vtime_start = os::elapsedVTime();
// 0~n-1线程是真正的Refine线程,处理RSet
while (!_should_terminate) {
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
// Wait for work
// 这个就是我们上面提到的前一个线程通知后一个线程,0号线程由Mutator通知
wait_for_completed_buffers();
if (_should_terminate) {
break;
}
{
SuspendibleThreadSetJoiner sts;
do {
int curr_buffer_num = (int)dcqs.completed_buffers_num();
// If the number of the buffers falls down into the yellow zone,
// that means that the transition period after the evacuation pause has ended.
if (dcqs.completed_queue_padding() > 0 && curr_buffer_num <= cg1r()->yellow_zone()) {
dcqs.set_completed_queue_padding(0);
}
// 根据负载判断是否需要停止当前的Refine线程,如果需要则停止。
if (_worker_id > 0 && curr_buffer_num <= _deactivation_threshold) {
// If the number of the buffer has fallen below our threshold
// we should deactivate. The predecessor will reactivate this
// thread should the number of the buffers cross the threshold again.
deactivate();
break;
}
// Check if we need to activate the next thread.
// 根据负载判断是否需要通知/启动新的Refine线程,如果需要则发一个通知。
if (_next != NULL && !_next->is_active() && curr_buffer_num > _next->_threshold) {
_next->activate();
}
} while (dcqs.apply_closure_to_completed_buffer(_refine_closure, _worker_id + _worker_id_offset, cg1r()->green_zone()));
// We can exit the loop above while being active if there was a yield request.
// 当有yield请求时退出循环,目的是为了进入安全点
if (is_active()) {
deactivate();
}
}
if (os::supports_vtime()) {
_vtime_accum = (os::elapsedVTime() - _vtime_start);
} else {
_vtime_accum = 0.0;
}
}
assert(_should_terminate, "just checking");
terminate();
}
void ConcurrentG1RefineThread::stop() {
// it is ok to take late safepoints here, if needed
{
MutexLockerEx mu(Terminator_lock);
_should_terminate = true;
}
{
MutexLockerEx x(_monitor, Mutex::_no_safepoint_check_flag);
_monitor->notify();
}
{
MutexLockerEx mu(Terminator_lock);
while (!_has_terminated) {
Terminator_lock->wait();
}
}
if (G1TraceConcRefinement) {
gclog_or_tty->print_cr("G1-Refine-stop");
}
}
void ConcurrentG1RefineThread::print() const {
print_on(tty);
}
void ConcurrentG1RefineThread::print_on(outputStream* st) const {
st->print("\"G1 Concurrent Refinement Thread#%d\" ", _worker_id);
Thread::print_on(st);
st->cr();
}
Refine线程主要工作就是处理DCQS,具体在这个while循环中:(dcqs.apply_closure_to_completed_buffer(_refine_closure, _worker_id +_worker_id_offset, cg1r()->green_zone()));循环调用apply_closure_to_completed_buffer,这个方法传递了几个参数:
src\share\vm\gc_implementation\g1\dirtyCardQueue.cpp
bool DirtyCardQueueSet::apply_closure_to_completed_buffer(CardTableEntryClosure* cl,
uint worker_i,
int stop_at,
bool during_pause) {
assert(!during_pause || stop_at == 0, "Should not leave any completed buffers during a pause");
BufferNode* nd = get_completed_buffer(stop_at);
bool res = apply_closure_to_completed_buffer_helper(cl, worker_i, nd);
if (res) Atomic::inc(&_processed_buffers_rs_thread);
return res;
}
bool DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(
CardTableEntryClosure* cl, uint worker_i, BufferNode* nd) {
if (nd != NULL) {
void **buf = BufferNode::make_buffer_from_node(nd);
size_t index = nd->index();
bool b = DirtyCardQueue::apply_closure_to_buffer(cl, buf,
index, _sz,
true, worker_i);
if (b) {
deallocate_buffer(buf);
return true; // In normal case, go on to next buffer.
} else {
enqueue_complete_buffer(buf, index);
return false;
}
} else {
return false;
}
}
- 参数Closure,真正处理卡表。
- 参数worker id + workerid offset,工作线程要处理的开始位置,让不同的Refine线程处理DCQS中不同的DCQ。
- 参数cglr()->green zone(),就是Green Zone的数值,也就是说所有的Refine线程在处理的时候都知道要跳过至少Green的个数的DCQ,即忽略DCQS中DCQ的区域。同时也可以想象到,在GC收集的地方这个参数一定会传入0,表示要处理所有的DCQ。可以参看下文新生代回收中的G1CollectedHeap::iterate_dirty_card_closure。
另外因为queue set是全局共享,对queue set的处理是需要加锁的。这个方法会调用DirtyCardQueue::apply_closure_to_buffer,代码如下所示:
src\share\vm\gc_implementation\g1\g1CollectedHeap.cpp
void G1CollectedHeap::iterate_dirty_card_closure(CardTableEntryClosure* cl,
DirtyCardQueue* into_cset_dcq,
bool concurrent,
uint worker_i) {
// Clean cards in the hot card cache
G1HotCardCache* hot_card_cache = _cg1r->hot_card_cache();
hot_card_cache->drain(worker_i, g1_rem_set(), into_cset_dcq);
DirtyCardQueueSet& dcqs = JavaThread::dirty_card_queue_set();
size_t n_completed_buffers = 0;
while (dcqs.apply_closure_to_completed_buffer(cl, worker_i, 0, true)) {
n_completed_buffers++;
}
g1_policy()->phase_times()->record_thread_work_item(G1GCPhaseTimes::UpdateRS, worker_i, n_completed_buffers);
dcqs.clear_n_completed_buffers();
assert(!dcqs.completed_buffers_exist_dirty(), "Completed buffers exist!");
}
另外因为queue set是全局共享,对queue set的处理是需要加锁的。这个方法会调用DirtyCardQueue::apply_closure_to_buffer,代码如下所示:
src\share\vm\gc_implementation\g1\dirtyCardQueue.cpp
bool DirtyCardQueue::apply_closure_to_buffer(CardTableEntryClosure* cl,
void** buf,
size_t index, size_t sz,
bool consume,
uint worker_i) {
if (cl == NULL) return true;
for (size_t i = index; i < sz; i += oopSize) {
int ind = byte_index_to_index((int)i);
jbyte* card_ptr = (jbyte*)buf[ind];
if (card_ptr != NULL) {
// Set the entry to null, so we don't do it again (via the test
// above) if we reconsider this buffer.
// 设置buf为NULL,再对buf遍历时就可以快速跳过NULL
if (consume) buf[ind] = NULL;
if (!cl->do_card_ptr(card_ptr, worker_i)) return false;
}
}
return true;
}
最终会调用refine_card,代码如下所示:
src\share\vm\gc_implementation\g1\g1RemSet.cpp
bool G1RemSet::refine_card(jbyte* card_ptr, uint worker_i,
bool check_for_refs_into_cset) {
assert(_g1->is_in_exact(_ct_bs->addr_for(card_ptr)),
err_msg("Card at " PTR_FORMAT " index " SIZE_FORMAT " representing heap at " PTR_FORMAT " (%u) must be in committed heap",
p2i(card_ptr),
_ct_bs->index_for(_ct_bs->addr_for(card_ptr)),
_ct_bs->addr_for(card_ptr),
_g1->addr_to_region(_ct_bs->addr_for(card_ptr))));
// If the card is no longer dirty, nothing to do.
// 如果卡表指针对应的值已经不是dirty,说明该指针已经处理过了,所以不再需要处理,直接返回
if (*card_ptr != CardTableModRefBS::dirty_card_val()) {
// No need to return that this card contains refs that point
// into the collection set.
return false;
}
// Construct the region representing the card.
// 找到卡表指针所在的分区
HeapWord* start = _ct_bs->addr_for(card_ptr);
// And find the region containing it.
HeapRegion* r = _g1->heap_region_containing(start);
// Why do we have to check here whether a card is on a young region,
// given that we dirty young regions and, as a result, the
// post-barrier is supposed to filter them out and never to enqueue
// them? When we allocate a new region as the "allocation region" we
// actually dirty its cards after we release the lock, since card
// dirtying while holding the lock was a performance bottleneck. So,
// as a result, it is possible for other threads to actually
// allocate objects in the region (after the acquire the lock)
// before all the cards on the region are dirtied. This is unlikely,
// and it doesn't happen often, but it can happen. So, the extra
// check below filters out those cards.
/*引用者是新生代或者在CSet都不需要更新,因为他们都会在GC中被收集。
实际上在引用关系进入到队列的时候会被过滤,4.4节写屏障时会介绍。
问题是为什么我们还需要再次过滤?主要是考虑并发的因素。比如并发分配或者并行任务窃取等。*/
if (r->is_young()) {
return false;
}
// While we are processing RSet buffers during the collection, we
// actually don't want to scan any cards on the collection set,
// since we don't want to update remebered sets with entries that
// point into the collection set, given that live objects from the
// collection set are about to move and such entries will be stale
// very soon. This change also deals with a reliability issue which
// involves scanning a card in the collection set and coming across
// an array that was being chunked and looking malformed. Note,
// however, that if evacuation fails, we have to scan any objects
// that were not moved and create any missing entries.
if (r->in_collection_set()) {
return false;
}
// The result from the hot card cache insert call is either:
// * pointer to the current card
// (implying that the current card is not 'hot'),
// * null
// (meaning we had inserted the card ptr into the "hot" card cache,
// which had some headroom),
// * a pointer to a "hot" card that was evicted from the "hot" cache.
//
/*对于热表可以通过参数控制,处理的时候如果发现它不热,则直接处理;
如果热的话则留待后续批量处理。
如果热表存的对象太多,最老的则会被赶出继续处理。*/
G1HotCardCache* hot_card_cache = _cg1r->hot_card_cache();
if (hot_card_cache->use_cache()) {
assert(!check_for_refs_into_cset, "sanity");
assert(!SafepointSynchronize::is_at_safepoint(), "sanity");
card_ptr = hot_card_cache->insert(card_ptr);
if (card_ptr == NULL) {
// There was no eviction. Nothing to do.
return false;
}
start = _ct_bs->addr_for(card_ptr);
r = _g1->heap_region_containing(start);
// Checking whether the region we got back from the cache
// is young here is inappropriate. The region could have been
// freed, reallocated and tagged as young while in the cache.
// Hence we could see its young type change at any time.
}
// Don't use addr_for(card_ptr + 1) which can ask for
// a card beyond the heap. This is not safe without a perm
// gen at the upper end of the heap.
// 确定要处理的内存块为512个字节
HeapWord* end = start + CardTableModRefBS::card_size_in_words;
MemRegion dirtyRegion(start, end);
#if CARD_REPEAT_HISTO
init_ct_freq_table(_g1->max_capacity());
ct_freq_note_card(_ct_bs->index_for(start));
#endif
// 定义Closure处理对象,最主要的是G1ParPushHeapRSClosure
G1ParPushHeapRSClosure* oops_in_heap_closure = NULL;
if (check_for_refs_into_cset) {
// ConcurrentG1RefineThreads have worker numbers larger than what
// _cset_rs_update_cl[] is set up to handle. But those threads should
// only be active outside of a collection which means that when they
// reach here they should have check_for_refs_into_cset == false.
assert((size_t)worker_i < n_workers(), "index of worker larger than _cset_rs_update_cl[].length");
oops_in_heap_closure = _cset_rs_update_cl[worker_i];
}
G1UpdateRSOrPushRefOopClosure update_rs_oop_cl(_g1,
_g1->g1_rem_set(),
oops_in_heap_closure,
check_for_refs_into_cset,
worker_i);
update_rs_oop_cl.set_from(r);
G1TriggerClosure trigger_cl;
FilterIntoCSClosure into_cs_cl(NULL, _g1, &trigger_cl);
G1InvokeIfNotTriggeredClosure invoke_cl(&trigger_cl, &into_cs_cl);
G1Mux2Closure mux(&invoke_cl, &update_rs_oop_cl);
FilterOutOfRegionClosure filter_then_update_rs_oop_cl(r,
(check_for_refs_into_cset ?
(OopClosure*)&mux :
(OopClosure*)&update_rs_oop_cl));
// The region for the current card may be a young region. The
// current card may have been a card that was evicted from the
// card cache. When the card was inserted into the cache, we had
// determined that its region was non-young. While in the cache,
// the region may have been freed during a cleanup pause, reallocated
// and tagged as young.
//
// We wish to filter out cards for such a region but the current
// thread, if we're running concurrently, may "see" the young type
// change at any time (so an earlier "is_young" check may pass or
// fail arbitrarily). We tell the iteration code to perform this
// filtering when it has been determined that there has been an actual
// allocation in this region and making it safe to check the young type.
bool card_processed =
r->oops_on_card_seq_iterate_careful(dirtyRegion,
&filter_then_update_rs_oop_cl,
card_ptr);
// If unable to process the card then we encountered an unparsable
// part of the heap (e.g. a partially allocated object) while
// processing a stale card. Despite the card being stale, redirty
// and re-enqueue, because we've already cleaned the card. Without
// this we could incorrectly discard a non-stale card.
if (!card_processed) {
assert(!_g1->is_gc_active(), "Unparsable heap during GC");
// The card might have gotten re-dirtied and re-enqueued while we
// worked. (In fact, it's pretty likely.)
if (*card_ptr != CardTableModRefBS::dirty_card_val()) {
*card_ptr = CardTableModRefBS::dirty_card_val();
MutexLockerEx x(Shared_DirtyCardQ_lock,
Mutex::_no_safepoint_check_flag);
DirtyCardQueue* sdcq =
JavaThread::dirty_card_queue_set().shared_dirty_card_queue();
sdcq->enqueue(card_ptr);
}
} else {
_conc_refine_cards++;
}
// This gets set to true if the card being refined has
// references that point into the collection set.
bool has_refs_into_cset = trigger_cl.triggered();
// We should only be detecting that the card contains references
// that point into the collection set if the current thread is
// a GC worker thread.
assert(!has_refs_into_cset || SafepointSynchronize::is_at_safepoint(),
"invalid result at non safepoint");
return has_refs_into_cset;
}
上面只是给出这512字节的区域需要处理,但是这个区域里面第一个对象的地址在哪里?这需要遍历该堆分区,跳过这个内存块之前的地址,然后找到第一个对象,把这512字节里面的内存块都作为引用者来处理。这就是为什么会产生浮动垃圾的原因之一。代码如下所示:
hotspot/src/share/vm/gc_implementation/g1/heapRegion.cpp
HeapWord* HeapRegion::oops_on_card_seq_iterate_careful(MemRegion mr,
FilterOutOfRegionClosure* cl,
bool filter_young,
jbyte* card_ptr) {
if (g1h->is_gc_active()) {
mr = mr.intersection(MemRegion(bottom(), scan_top()));
} else {
mr = mr.intersection(used_region());
}
if (mr.is_empty()) return NULL;
if (is_young() && filter_young) return NULL;
// 把卡表改变成clean状态,这是为了说明该内存块正在被处理
if (card_ptr != NULL) {
*card_ptr = CardTableModRefBS::clean_card_val();
OrderAccess::storeload();
}
HeapWord* const start = mr.start();
HeapWord* const end = mr.end();
HeapWord* cur = block_start(start);
// 跳过不在处理区域的对象
oop obj;
HeapWord* next = cur;
do {
cur = next;
obj = oop(cur);
if (obj->klass_or_null() == NULL) return cur;
next = cur + block_size(cur);
} while (next <= start);
// 直到达到这512字节的内存块,然后遍历这个内存块
do {
obj = oop(cur);
if (obj->klass_or_null() == NULL) return cur;
cur = cur + block_size(cur);
// 此处判断对象是否死亡的依据是根据内存的快照,这个在并发标记中会提到
if (!g1h->is_obj_dead(obj)) {
// 遍历对象
if (!obj->is_objArray() || (((HeapWord*)obj) >= start && cur <= end))
{
obj->oop_iterate(cl);
} else {
obj->oop_iterate(cl, mr);
}
}
} while (cur < end);
return NULL;
}
遍历到的每一个对象都会使用G1UpdateRSOrPushRefOopClosure更新RSet,代码如下所示:
hotspot/src/share/vm/gc_implementation/g1/g1OopClosures.inline.hpp
template <class T> inline void G1UpdateRSOrPushRefOopClosure::do_oop_nv(T* p) {
oop obj = oopDesc::load_decode_heap_oop(p);
if (obj == NULL) return;
HeapRegion* to = _g1->heap_region_containing(obj);
// 只处理不同分区之间的引用关系
if (_from == to) return;
if (_record_refs_into_cset && to->in_collection_set()) {
/* Evac的情况才能进入到这里,对于正常情况把对象放入栈中继续处理,这里主要处理分区内部的引用,只需要复制对象,不必维护引用关系。失败的情况则需要通过特殊路径来处理,参见7.1节*/
if (!self_forwarded(obj)) {
// 对于成功转移的对象放入G1ParScanThreadState的队列中处理
_push_ref_cl->do_oop(p);
}
} else {
to->rem_set()->add_reference(p, _worker_i);
}
}
更新的方法就是add_reference,这个前面已经提到,就是更新PRT信息。整个RSet更新流程简单一句话总结就是,根据引用者找到被引用者,然后在被引用者所在的分区的RSet中记录引用关系。这里有没有关于并发执行的疑问?会不会存在Refine线程在执行过程中被引用者的地址发生变化,从而不能从引用者准确地找到被引用者对象?这个情况并不会发生,因为在Refine线程执行的过程中并不会发生GC,也不会发生对象的移动,即对象地址都是固定的。
Refinement Zone
Refine线程最主要的工作正如上文所讲就是维护RSet。实际上这也是G1调优中很重要的一部分,据资料测试表明RSet在很多情况下要浪费1%~20%左右的空间,比如100G的空间,有可能高达20G给RSet使用;另一方面,有可能过多RSet的更新会导致Mutator很慢,因为Mutator发现DCQS太满会主动帮助Refine线程处理。这和Refine线程的设计有关。通常我们可以设置多个Refine线程工作,在不同的工作负载下启用的线程不同,这个工作负载通过Refinement Zone控制。G1提供三个值,分别是Green、Yellow和Red,将整个Queue set划分成4个区,姑且称为白、绿、黄和红。
- 白区:[0,Green),对于该区,Refine线程并不处理,交由GC线程来处理DCQ。
- 绿区:[Green,Yellow),在该区中,Refine线程开始启动,并且根据queue set数值的大小启动不同数量的Refine线程来处理DCQ。
- 黄区:[Yellow,Red),在该区,所有的Refine线程(除了抽样线程)都参与DCQ处理。
- 红区:[Red,+无穷),在该区,不仅仅所有的Refine线程参与处理RSet,而且连Mutator也参与处理dcq。
这三个值通过三个参数设置:G1ConcRefinementGreenZone、G1ConcRefinementYellowZone、G1ConcRefinementRedZone,默认值都是0。如果没有设置这三个值,G1则自动推断这三个区的阈值大小,如下所示:
- G1ConcRefinementGreenZone为ParallelGCThreads。
- G1ConcRefinementYellowZone和G1ConcRefinementRedZone是G1ConcRefinementGreenZone的3倍和6倍。这里留一个小小的问题,为什么JDK的设计者要把G1ConcRefinementGreenZone和并行线程数ParallelGCThreads关联?
上面提到在黄区时所有的Refine线程都会参与DCQ处理,那么有多少个线程?这个值可以通过参数G1ConcRefinementThreads设置,默认值为0,当没有设置该值时G1可以启发式推断,设置为ParallelGCThreads。ParallelGCThreads也可以通过参数设置,默认值为0,如果没有设置,G1也可以启发式推断出来,如下所示:
ParallelGCThreads=ncpus,当ncpus小于等于8,ncpus为cpu内核的个数8+(ncpus-8)*5/8,当ncpus>8,ncpus为cpu内核的个数
在绿区的时候,Refine线程会根据DCQS数值的大小启动不同数量的Refine线程,有一个参数用于控制每个Refine线程消费队列的步长,这个参数是:G1ConcRefinementThresholdStep,如果不设置,可以自动推断为:Refine线程+1。假设ParallelGCThreads=4,G1ConcRefinementThreads=3,G1ConcRefinementThresholdStep=黄区个数-绿区个数/(worknum+1),则自动推断为2。绿黄红的个数分别为={4,12,24}。这里将有4个Refine线程,0号线程:DCQS中的DCQ超过4个开始启动,低于4个终止;1号线程:DCQS中的DCQ到达9个开始启动,低于6个终止;2号线程:DCQS中的DCQ达到11个开始启动,低于8个终止,3号线程:处理新生代分区的抽样。当DCQS中的DCQ超过24个时,Mutator开始工作。即DCQS最多24个。
RSet涉及的写屏障
我们一直提到一个概念就是引用关系。Refine主要关注的就是引用关系的变更,更准确地说就是对象的赋值。那么如何识别引用关系的变更?这就需要写屏障。写屏障是指在改变特定内存的值时(实际上也就是写入内存)额外执行的一些动作。在大多数的垃圾回收算法中,都用到了写屏障。
写屏障通常用于在运行时探测并记录回收相关指针(interesting pointer),在回收器只回收堆中部分区域的时候,任何来自该区域外的指针都需要被写屏障捕获,这些指针将会在垃圾回收的时候作为标记开始的根。典型的CMS中也是通过写屏障记录引用关系,G1也是如此。举例来说,每一次将一个老生代对象的引用修改为指向新生代对象,都会被写屏障捕获,并且记录下来。因此在新生代回收的时候,就可以避免扫描整个老生代来查找根。G1垃圾回收器的RSet就是通过写屏障完成的,在写变更的时候通过插入一条额外的代码把引用关系放入到DCQ中,随后Refine线程更新RSet,记录堆分区内部中对象的指针。这种记录发生在写操作之后。对于一个写屏障来说,过滤掉不必要的写操作是十分必要的。这种过滤既能加快赋值器的速度,也能减轻回收器的负担。
G1垃圾回收器采用三重过滤:
- 不记录新生代到新生代的引用或者新生代到老生代的引用(因为在垃圾回收时,新生代的堆分区都会被会收集),在写屏障时过滤。
- 过滤掉同一个分区内部引用,在RSet处理时过滤。
- 过滤掉空引用,在RSet处理时过滤。
过滤掉之后,可以使RSet的大小大大减小。这里还有一个问题,就是何时触发写屏障更新DCQ,关于这一点在混合回收中涉及写屏障时还会更为详细地介绍。G1垃圾回收器的写屏障使用一种两级的缓存结构(用queue set实现):
- 线程queue set:每个线程自己的queue set。所有的线程都会把写屏障的记录先放入自己的queue set中,装满了之后,就会把queue set放到global set of filled queue中,而后再申请一个queue set。
- global set of filled buffer:所有线程共享的一个全局的、存放填满了的DCQS的集合。
日志解读
为了模拟写屏障,这里给出一个例子,在代码中分配较大的内存以保证这些对象直接分配到老生代中,这样我们就能发现RSet的更多信息,如下所示:
public class RSetTest {
static Object[] largeObject1 = new Object[1024 * 1024];
static Object[] largeObject2 = new Object[1024 * 1024];
static int[] temp;
public static void main(String[] args) {
int numGCs = 200;
for (int k = 0; k < numGCs - 1; k++) {
for (int i = 0; i < largeObject1.length; i++) {
largeObject1[i] = largeObject2;
}
for (int i = 0; i < largeObject2.length; i++) {
largeObject2[i] = largeObject1;
}
for (int i = 0; i < 1024 ; i++) {
temp = new int[1024];
}
System.gc();
}
}
}
通过打开G1TraceConcRefinement观察Refine线程的工作情况:
-Xmx256M -XX:+UseG1GC -XX:G1ConcRefinementThreads=4
-XX:G1ConcRefinementGreenZone=1 -XX:G1ConcRefinementYellowZone=2
-XX:G1ConcRefinementRedZone=3 -XX:+UnlockExperimentalVMOptions
-XX:G1LogLevel=finest -XX:+UnlockDiagnosticVMOptions
-XX:+G1TraceConcRefinement -XX:+PrintGCTimeStamps
得到的日志如下:
1.725: [Full GC (System.gc()) 12M->8854K(29M), 0.0150339 secs]
[Eden: 5120.0K(10.0M)->0.0B(10.0M) Survivors: 0.0B->0.0B Heap:
12.9M(29.0M)->8854.2K(29.0M)], [Metaspace: 3484K->3484K(1056768K)]
[Times: user=0.01 sys=0.00, real=0.02 secs]
G1-Refine-activated worker 1, on threshold 1, current 2
G1-Refine-deactivated worker 1, off threshold 1, current 1
G1-Refine-activated worker 1, on threshold 1, current 3
G1-Refine-activated worker 2, on threshold 1, current 2
G1-Refine-deactivated worker 2, off threshold 1, current 1
在这个日志中我们能看到多个Refine线程的工作状况,能看到不同的Refine线程在不同的阈值下激活或者消亡。
通过打开G1SummarizeRSetStats来观察RSet更新的详细信息,如下所示:
-Xmx256M -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions
-XX:G1LogLevel=finest -XX:+UnlockDiagnosticVMOptions
-XX:+G1SummarizeRSetStats -XX:G1SummarizeRSetStatsPeriod=1
-XX:+PrintGCTimeStamps
下面是具体的日志:
Cumulative RS summary
Recent concurrent refinement statistics
Processed 3110803 cards
Of 12941 completed buffers:
12941 ( 100.0%) by concurrent RS threads.
0 ( 0.0%) by mutator threads.
Did 0 coarsenings.
一共处理了3 110 803个内存块,其中使用了12 941个队列。按照每个队列最大256个元素来就算,最多有3 312 896个元素,这说明在处理的时候有些队列并没有满。其中12 941个队列是由Refine线程处理的,0个是没有Mutator参与处理,0个也表示分区里面的PRT粗粒度化的分区个数为0。由上面的日志可知Refine线程一共有9个,8个用于处理RSet,1个用于抽样。其中有两个Refine线程分别花费200ms和80ms,其他6个线程可能都没有启动:
Concurrent RS threads times (s)
0.20 0.08 0.00 0.00 0.00 0.00 0.00 0.00
Concurrent sampling threads times (s)
0.00
这一部分给出的是RSet占用的额外内存空间信息:
Current rem set statistics
Total per region rem sets sizes = 85K. Max = 4K.
2K ( 3.3%) by 1 Young regions
31K ( 36.4%) by 10 Humonguous regions
48K ( 56.5%) by 17 Free regions
3K ( 3.8%) by 1 Old regions
Static structures = 16K, free_lists = 0K.
这一部分给出的是RSet中PRT表中被设置了多少次,也可以说是内存块被引用了多少次:
16388 occupied cards represented.
0 ( 0.0%) entries by 1 Young regions
16388 (100.0%) entries by 10 Humonguous regions
0 ( 0.0%) entries by 17 Free regions
0 ( 0.0%) entries by 1 Old regions
Region with largest rem set = 0:(HS)[0x00000000f0000000,0x00000000f0400010,
0x00000000f0500000], size = 4K, occupied = 8K.
这一部分给出的是HeapRegion中JIT代码的信息:
Total heap region code root sets sizes = 0K. Max = 0K.
0K ( 1.8%) by 1 Young regions
0K ( 17.7%) by 10 Humonguous regions
0K ( 30.1%) by 17 Free regions
0K ( 50.4%) by 1 Old regions
16 code roots represented.
0 ( 0.0%) elements by 1 Young regions
0 ( 0.0%) elements by 10 Humonguous regions
0 ( 0.0%) elements by 17 Free regions
16 (100.0%) elements by 1 Old regions
Region with largest amount of code roots = 10:(O)[0x00000000f0a00000,
0x00000000f0aae898,0x00000000f0b00000], size = 0K, num_elems = 0
参数介绍和调优
本章主要讨论G1新引入的Refine线程,用于处理分区间的引用,快速地识别活跃对象。以下是本章涉及的参数以及用法:
- ·参数G1ConcRefinementThreads,指的是G1 Refine线程的个数,默认值为0,G1可以启发式推断,将并行的线程数ParallelGCThreads作为并发线程数,其中并行线程数可以设置,也可以启发式推断。通常大家不用设置这个参数,并行线程数可以简单总结为CPU个数的5/8,具体的推断方法见上文。
- ·参数G1UpdateBufferSize,指的是DCQ的长度,默认值是256,增大该值可以保存更多的待处理引用关系。
- ·参数G1UseAdaptiveConcRefinement,默认值为true,表示可以动态调整Refinement Zone的数字区间,调整的依据在于RSet时间是否满足目标时间。
- ·参数G1RSetUpdatingPauseTimePercent,默认值为10,即RSet所用的全部时间不超过GC完成时间的10%。如果超过并且设置了参数G1UseAdaptiveConcRefinement为true,更新Green Zone的方法为:当RSet处理时间超过目标时间,Green zone变成原来的0.9倍,否则如果更新的处理过的队列大于Green Zone,增大Green zone为原来的1.1倍,否则不变;对于Yellow Zone和Red Zone分别为Green Zone的3倍和6倍。这里特别要注意的是当动态变化时,可能导致Green Zone为0,那么Yellow Zone和Red Zone都为0,如果这种情况发生,意味着Refine线程不再工作,利用Mutator来处理RSet,这通常绝非我们想要的结果。所以在设置的时候,可以关闭动态调整,或者设置合理的RSet处理时间。关闭动态调整需要有更好的经验,所以设置合理的RSet处理时间更为常见。
- ·参数G1ConcRefinementThresholdStep,默认值为0,如果没有定义G1会启发式推断,依赖于Yellow Zone和Green Zone。这个值表示的是多个更新RSet的Refine线程对于整个DirtyCardQueueSet的处理步长。
- ·参数G1ConcRefinementServiceIntervalMillis,默认值为300,表示RS对新生代的抽样线程间隔时间为300ms。
- ·参数G1ConcRefinementGreenZone,指定Green Zone的大小,默认值为0,G1可以启发式推断。如果设置为0,那么当动态调整关闭,将导致Refine工作线程不工作,如果不进行动态调整,意味着GC会处理所有的队列;如果该值不为0,表示Refine线程在每次工作时会留下这些区域,不处理这些RSet。这个值如果需要设置生效的话,要把动态调整关闭。通常并不设置这个参数。
- ·参数G1ConcRefinementYellowZone,指定Yellow Zone的大小,默认值为0,G1可以启发式推断,是Green Zone的3倍。
- ·参数G1ConcRefinementRedZone,指定Red Zone的大小,默认值为0,G1可以启发式推断,是Green Zone的6倍,通常来说并不需要调整G1ConcRefinementGreenZone、G1ConcRefinementYellowZone和G1ConcRefinementRedZone这3个参数,但是如果遇到RSet处理太慢的情况,也可以关闭G1UseAdaptiveConcRefinement,然后根据Refine线程数目设置合理的值。
- ·参数G1ConcRSLogCacheSize,默认值为10,即存储hot card最多为
,也就是1024个。那么超过1024个该如何处理?实际上JVM设计得很简单,超过1024,直接把老的那个card拿出去处理,相当于认为它不再是hot card。
- ·参数G1ConcRSHotCardLimit,默认值为4,当一个card被修改4次,则认为是hot card,设计hot card的目的是为了减少该对象修改的次数,因为RSet在被引用的分区存储,所以可能有多个对象引用这个对象,再处理这个对象的时候,可以一次性地把这多个对象都作为根。
- ·参数G1RSetRegionEntries,默认值为0,G1可以启发式推断。base*(log(region_size/1M)+1),base的默认值是256,base仅允许在开发版本设置,在发布版本不能更改base。这个值很关键,太小将会导致RSet的粒度从细变粗,导致追踪标记对象将花费更多的时间。另外,从上面的公式中也可以得到:通过调整HeapRegionSize来影响该值的推断,如人工设置HeapRegionSize。实际工作中也可以根据业务情况直接设置该值(如设置为1024);这样能保持较高的性能,此时每个分区中的细粒度卡表都使用1024项,所有分区中这一部分占用的额外空间加起来就是个不小的数字了,这也是为什么RSet浪费空间的地方。
- ·参数G1SummarizeRSetStats打印RSet的统计信息,G1SummarizeRSetStatsPeriod=n,表示GC每发生n次就统计一次,默认值是0,表示不会周期性地收集信息。在生产中通常不会使用信息收集。