源码为0.98.1
HRegionServer中起线程MemStoreFlusher
private void initializeThreads() throws IOException {
// Cache flushing thread.
this.cacheFlusher = new MemStoreFlusher(conf, this);
// Compaction thread
this.compactSplitThread = new CompactSplitThread(this);
.......
private void startServiceThreads() throws IOException {
String n = Thread.currentThread().getName();
......
this.cacheFlusher.start(uncaughtExceptionHandler);
Threads.setDaemonThreadRunning(this.compactionChecker.getThread(), n +
".compactionChecker", uncaughtExceptionHandler);
.....
/*
* Run init. Sets up hlog and starts up all server threads.
*
* @param c Extra configuration.
*/
protected void handleReportForDutyResponse(final RegionServerStartupResponse c)
throws IOException {
....
startServiceThreads();
.....
public void run() {
try {
// Do pre-registration initializations; zookeeper, lease threads, etc.
preRegistrationInitialization();
} catch (Throwable e) {
abort("Fatal exception during initialization", e);
}
try {
// Try and register with the Master; tell it we are here. Break if
// server is stopped or the clusterup flag is down or hdfs went wacky.
while (keepLooping()) {
RegionServerStartupResponse w = reportForDuty();
if (w == null) {
LOG.warn("reportForDuty failed; sleeping and then retrying.");
this.sleeper.sleep();
} else {
handleReportForDutyResponse(w);//启动所有hregionserver线程服务
break;
}
}
....
主要的类,方法:memStoreFlusher的flushRegion
private boolean flushRegion(final HRegion region, final boolean emergencyFlush) {
synchronized (this.regionsInQueue) {
FlushRegionEntry fqe = this.regionsInQueue.remove(region);
if (fqe != null && emergencyFlush) {
// Need to remove from region from delay queue. When NOT an
// emergencyFlush, then item was removed via a flushQueue.poll.
flushQueue.remove(fqe);
}
}
lock.readLock().lock();
try {
boolean shouldCompact = region.flushcache();
// We just want to check the size
boolean shouldSplit = region.checkSplit() != null;
if (shouldSplit) {
this.server.compactSplitThread.requestSplit(region);
} else if (shouldCompact) {
server.compactSplitThread.requestSystemCompaction(
region, Thread.currentThread().getName());
}
......
从flushQueue中取出FlushRegionEntry进行flush
获取读锁
- 调用HRegion进行flush,并返回是否需要compact
- 调用HRegion查看是否需要split
- if(split) spliting elif(compact) compacting
以下是具体操作:
--------------------------------------------------------------------------------------------------------------------
1.HRegion
protected boolean internalFlushcache(
final HLog wal, final long myseqid, MonitoredTask status)
throws IOException {
if (this.rsServices != null && this.rsServices.isAborted()) {
// Don't flush when server aborting, it's unsafe
throw new IOException("Aborting flush because server is abortted...");
}
final long startTime = EnvironmentEdgeManager.currentTimeMillis();
// Clear flush flag.
// If nothing to flush, return and avoid logging start/stop flush.
if (this.memstoreSize.get() <= 0) {
if(LOG.isDebugEnabled()) {
LOG.debug("Empty memstore size for the current region "+this);
}
return false;
}
if (LOG.isDebugEnabled()) {
LOG.debug("Started memstore flush for " + this +
", current region memstore size " +
StringUtils.humanReadableInt(this.memstoreSize.get()) +
((wal != null)? "": "; wal is null, using passed sequenceid=" + myseqid));
}
// Stop updates while we snapshot the memstore of all stores. We only have
// to do this for a moment. Its quick. The subsequent sequence id that
// goes into the HLog after we've flushed all these snapshots also goes
// into the info file that sits beside the flushed files.
// We also set the memstore size to zero here before we allow updates
// again so its value will represent the size of the updates received
// during the flush
MultiVersionConsistencyControl.WriteEntry w = null;
// We have to take a write lock during snapshot, or else a write could
// end up in both snapshot and memstore (makes it difficult to do atomic
// rows then)
status.setStatus("Obtaining lock to block concurrent updates");
// block waiting for the lock for internal flush
this.updatesLock.writeLock().lock();
long totalFlushableSize = 0;
status.setStatus("Preparing to flush by snapshotting stores");
List<StoreFlushContext> storeFlushCtxs = new ArrayList<StoreFlushContext>(stores.size());
long flushSeqId = -1L;
try {
// Record the mvcc for all transactions in progress.
w = mvcc.beginMemstoreInsert();
mvcc.advanceMemstore(w);
// check if it is not closing.
if (wal != null) {
if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {
status.setStatus("Flush will not be started for ["
+ this.getRegionInfo().getEncodedName() + "] - because the WAL is closing.");
return false;
}
flushSeqId = this.sequenceId.incrementAndGet();
} else {
// use the provided sequence Id as WAL is not being used for this flush.
flushSeqId = myseqid;
}
for (Store s : stores.values()) {
totalFlushableSize += s.getFlushableSize();
storeFlushCtxs.add(s.createFlushContext(flushSeqId));
}
// prepare flush (take a snapshot)
for (StoreFlushContext flush : storeFlushCtxs) {
//步骤1 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
flush.prepare();
}
} finally {
this.updatesLock.writeLock().unlock();
}
String s = "Finished memstore snapshotting " + this +
", syncing WAL and waiting on mvcc, flushsize=" + totalFlushableSize;
status.setStatus(s);
if (LOG.isTraceEnabled()) LOG.trace(s);
// sync unflushed WAL changes when deferred log sync is enabled
// see HBASE-8208 for details
if (wal != null && !shouldSyncLog()) {
//步骤2 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
wal.sync();
}
// wait for all in-progress transactions to commit to HLog before
// we can start the flush. This prevents
// uncommitted transactions from being written into HFiles.
// We have to block before we start the flush, otherwise keys that
// were removed via a rollbackMemstore could be written to Hfiles.
mvcc.waitForRead(w);
s = "Flushing stores of " + this;
status.setStatus(s);
if (LOG.isTraceEnabled()) LOG.trace(s);
// Any failure from here on out will be catastrophic requiring server
// restart so hlog content can be replayed and put back into the memstore.
// Otherwise, the snapshot content while backed up in the hlog, it will not
// be part of the current running servers state.
boolean compactionRequested = false;
try {
// A. Flush memstore to all the HStores.
// Keep running vector of all store files that includes both old and the
// just-made new flush store file. The new flushed file is still in the
// tmp directory.
for (StoreFlushContext flush : storeFlushCtxs) {
//步骤3 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
flush.flushCache(status);
}
// Switch snapshot (in memstore) -> new hfile (thus causing
// all the store scanners to reset/reseek).
for (StoreFlushContext flush : storeFlushCtxs) {
//步骤4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
boolean needsCompaction = flush.commit(status);
if (needsCompaction) {
compactionRequested = true;
}
}
storeFlushCtxs.clear();
// Set down the memstore size by amount of flush.
this.addAndGetGlobalMemstoreSize(-totalFlushableSize);
} catch (Throwable t) {
// An exception here means that the snapshot was not persisted.
// The hlog needs to be replayed so its content is restored to memstore.
// Currently, only a server restart will do this.
// We used to only catch IOEs but its possible that we'd get other
// exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch
// all and sundry.
if (wal != null) {
wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
}
DroppedSnapshotException dse = new DroppedSnapshotException("region: " +
Bytes.toStringBinary(getRegionName()));
dse.initCause(t);
status.abort("Flush failed: " + StringUtils.stringifyException(t));
throw dse;
}
// If we get to here, the HStores have been written.
if (wal != null) {
wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
}
// Record latest flush time
this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();
// Update the last flushed sequence id for region
completeSequenceId = flushSeqId;
// C. Finally notify anyone waiting on memstore to clear:
// e.g. checkResources().
synchronized (this) {
notifyAll(); // FindBugs NN_NAKED_NOTIFY
}
long time = EnvironmentEdgeManager.currentTimeMillis() - startTime;
long memstoresize = this.memstoreSize.get();
String msg = "Finished memstore flush of ~" +
StringUtils.humanReadableInt(totalFlushableSize) + "/" + totalFlushableSize +
", currentsize=" +
StringUtils.humanReadableInt(memstoresize) + "/" + memstoresize +
" for region " + this + " in " + time + "ms, sequenceid=" + flushSeqId +
", compaction requested=" + compactionRequested +
((wal == null)? "; wal=null": "");
LOG.info(msg);
status.setStatus(msg);
this.recentFlushes.add(new Pair<Long,Long>(time/1000, totalFlushableSize));
return compactionRequested;
}
调用HRegion的internalFlushcache方法
1.HRegion 1661 HStore 1941 prepare (获取写锁) 用memStore类复制kvset生成snapshot作为本次mem flush的内存
(每次flush会触发region内的所有store的flush,所以flush的最小单位是region,不是store,这也是不太建议多个cf理由的一个原因)
2.HRegion 1674 调用wal 等待wal完成
3.HRegion 1700 HStore flushCache生成tmpfile(一个HStore一个tmpfile,虽然用的tmpfiles是个List)
在
4.HRegion 1706 HStore将新生成的tmpfiles封装为HStorefile,
HStore调用updateStorefiles方法,获得写锁添加到StoreFileManager的List中,提供服务,清空snapshot
HStore 951 needsCompaction方法, 调用RatioBasedCompactionPolicy.needsCompaction方法,判断storm是否需要compact
(判断方法hfile数量大于hbase.hstore.compaction.min 和 hbase.hstore.compactionThreshold的最大值数(默认值为3))
--------------------------------------------------------------------------------------------------------------------
2. hregion查看是否split,实现类为split策略类:IncreasingToUpperBoundRegionSplitPolicy
@Override
protected boolean shouldSplit() {
if (region.shouldForceSplit()) return true;
boolean foundABigStore = false;
// Get count of regions that have the same common table as this.region
int tableRegionsCount = getCountOfCommonTableRegions();
// Get size to check
long sizeToCheck = getSizeToCheck(tableRegionsCount);
for (Store store : region.getStores().values()) {
// If any of the stores is unable to split (eg they contain reference files)
// then don't split
if ((!store.canSplit())) {
return false;
}
// Mark if any store is big enough
long size = store.getSize();
if (size > sizeToCheck) {
LOG.debug("ShouldSplit because " + store.getColumnFamilyName() +
" size=" + size + ", sizeToCheck=" + sizeToCheck +
", regionsWithCommonTable=" + tableRegionsCount);
foundABigStore = true;
}
}
return foundABigStore;
}
调用IncreasingToUpperBoundRegionSplitPolicy 65 shouldSplit方法,判断,这个region是否需要split
(又是以一个region查看是否需要split的,所以多个cf真的不好)
((init)initialSize = hbase.increasing.policy.initial.size(预先设置初始值大小) 或hbase.hregion.memstore.flush.size (memflush大小))
获取this.region所在表的所有region数 getCountOfCommonTableRegions 为regioncount
当regioncount在0到100之间,取配置hbase.hregion.max.filesize(默认10G)和initialSize*(regioncount^3)的最小值 否则取配置hbase.hregion.max.filesize(默认10G)
如,只有一个region,128*1^3=128M
128*2^3=1024M
128*3^3=3456M
128*4^3=8192M
128*5^3=16000M(15G) => 10G 当有5个region就可以用配置了
--------------------------------------------------------------------------------------------------------------------
3.if(split) spliting elif(compact) compacting
http://blackproof.iteye.com/blog/2037159
之前做过笔记,自己都快忘了
又写了一份region split的
public PairOfSameType<HRegion> stepsBeforePONR(final Server server,
final RegionServerServices services, boolean testing) throws IOException {
// Set ephemeral SPLITTING znode up in zk. Mocked servers sometimes don't
// have zookeeper so don't do zk stuff if server or zookeeper is null
if (server != null && server.getZooKeeper() != null) {
try {
//步骤1@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
createNodeSplitting(server.getZooKeeper(),
parent.getRegionInfo(), server.getServerName(), hri_a, hri_b);
} catch (KeeperException e) {
throw new IOException("Failed creating PENDING_SPLIT znode on " +
this.parent.getRegionNameAsString(), e);
}
}
this.journal.add(JournalEntry.SET_SPLITTING_IN_ZK);
if (server != null && server.getZooKeeper() != null) {
// After creating the split node, wait for master to transition it
// from PENDING_SPLIT to SPLITTING so that we can move on. We want master
// knows about it and won't transition any region which is splitting.
//步骤2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
znodeVersion = getZKNode(server, services);
}
//步骤3@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
this.parent.getRegionFileSystem().createSplitsDir();
this.journal.add(JournalEntry.CREATE_SPLIT_DIR);
Map<byte[], List<StoreFile>> hstoreFilesToSplit = null;
Exception exceptionToThrow = null;
try{
//步骤4@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
hstoreFilesToSplit = this.parent.close(false);
} catch (Exception e) {
exceptionToThrow = e;
}
if (exceptionToThrow == null && hstoreFilesToSplit == null) {
// The region was closed by a concurrent thread. We can't continue
// with the split, instead we must just abandon the split. If we
// reopen or split this could cause problems because the region has
// probably already been moved to a different server, or is in the
// process of moving to a different server.
exceptionToThrow = closedByOtherException;
}
if (exceptionToThrow != closedByOtherException) {
this.journal.add(JournalEntry.CLOSED_PARENT_REGION);
}
if (exceptionToThrow != null) {
if (exceptionToThrow instanceof IOException) throw (IOException)exceptionToThrow;
throw new IOException(exceptionToThrow);
}
if (!testing) {
//步骤5@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
services.removeFromOnlineRegions(this.parent, null);
}
this.journal.add(JournalEntry.OFFLINED_PARENT);
// TODO: If splitStoreFiles were multithreaded would we complete steps in
// less elapsed time? St.Ack 20100920
//
// splitStoreFiles creates daughter region dirs under the parent splits dir
// Nothing to unroll here if failure -- clean up of CREATE_SPLIT_DIR will
// clean this up.
//步骤6@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
splitStoreFiles(hstoreFilesToSplit);
// Log to the journal that we are creating region A, the first daughter
// region. We could fail halfway through. If we do, we could have left
// stuff in fs that needs cleanup -- a storefile or two. Thats why we
// add entry to journal BEFORE rather than AFTER the change.
//步骤7@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
this.journal.add(JournalEntry.STARTED_REGION_A_CREATION);
HRegion a = this.parent.createDaughterRegionFromSplits(this.hri_a);
// Ditto
this.journal.add(JournalEntry.STARTED_REGION_B_CREATION);
HRegion b = this.parent.createDaughterRegionFromSplits(this.hri_b);
return new PairOfSameType<HRegion>(a, b);
}
1.RegionSplitPolicy.getSplitPoint()获得region split的split point ,最大store的中间点midpoint最为split point
2.SplitRequest.run()
实例化SplitTransaction
st.prepare():split前准备:region是否关闭,所有hfile是否被引用
st.execute:执行split操作
1.createDaughters 创建两个region,获得parent region的写锁
1在zk上创建一个临时的node splitting point,
2等待master直到这个region转为splitting状态
3之后建立splitting的文件夹,
4等待region的flush和compact都完成后,关闭这个region
5从HRegionServer上移除,加入到下线region中
6进行regionsplit操作,创建线程池,用StoreFileSplitter类将region下的所有Hfile(StoreFile)进行split,
(split row在hfile中的不管,其他的都进行引用,把引用文件分别写到region下边)
7.生成左右两个子region,删除meta上parent,根据引用文件生成子region的regioninfo,写到hdfs上
2.stepsAfterPONR 调用DaughterOpener类run打开两个子region,调用initilize
a)向hdfs上写入.regionInfo文件以便meta挂掉以便恢复
b)初始化其下的HStore,主要是LoadStoreFiles函数:
对于该store函数会构造storefile对象,从hdfs上获取路径和文件,每个文件一个
storefile对象,对每个storefile对象会读取文件上的内容创建一个
HalfStoreFileReader读对象来操作该region的父region上的相应的文件,及该
region上目前存储的是引用文件,其指向的是其父region上的相应的文件,对该
region的所有读或写都将关联到父region上
将子Region添加到rs的online region列表上,并添加到meta表上
本文深入解析了HRegionServer中的MemStoreFlusher线程如何在HBase中执行内存刷新操作,以及如何根据数据量触发region分裂,包括刷新机制、分裂策略及其背后的逻辑与实现细节。
415

被折叠的 条评论
为什么被折叠?



