Configuring HBase Memstore: What You Should Know

本文深入探讨了HBase内部组件Memstore的工作原理及其配置方式。了解Memstore如何帮助提高HBase集群性能,并掌握其配置参数,有助于更好地进行读写操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

In this post we discuss what HBase users should know about one of the internal parts of HBase: the Memstore. Understanding underlying processes related to Memstore will help to configure HBase cluster towards better performance.

HBase Memstore

Let’s take a look at the write and read paths in HBase to understand what Memstore is, where how and why it is used.


Memstore Usage in HBase Read/Write Paths

(picture was taken from Intro to HBase Internals and Schema Design presentation)

When RegionServer (RS) receives write request, it directs the request to specific Region. Each Region stores set of rows. Rows data can be separated in multiple column families (CFs). Data of particular CF is stored in HStore which consists of Memstore and a set of HFiles. Memstore is kept in RS main memory, while HFiles are written to HDFS. When write request is processed, data is first written into the Memstore. Then, when certain thresholds are met (obviously, main memory is well-limited) Memstore data gets flushed into HFile.

The main reason for using Memstore is the need to store data on DFS ordered by row key. As HDFS is designed for sequential reads/writes, with no file modifications allowed, HBase cannot efficiently write data to disk as it is being received: the written data will not be sorted (when the input is not sorted) which means not optimized for future retrieval. To solve this problem HBase buffers last received data in memory (in Memstore), “sorts” it before flushing, and then writes to HDFS using fast sequential writes. Note that in reality HFile is not just a simple list of sorted rows, it is much more than that.

Apart from solving the “non-ordered” problem, Memstore also has other benefits, e.g.:

  • It acts as a in-memory cache which keeps recently added data. This is useful in numerous cases when last written data is accessed more frequently than older data
  • There are certain optimizations that can be done to rows/cells when they are stored in memory before writing to persistent store. E.g. when it is configured to store one version of a cell for certain CF and Memstore contains multiple updates for that cell, only most recent one can be kept and older ones can be omitted (and never written to HFile).

Important thing to note is that every Memstore flush creates one HFile per CF.

On the reading end things are simple: HBase first checks if requested data is in Memstore, then goes to HFiles and returns merged result to the user.

What to Care about

There are number of reasons HBase users and/or administrators should be aware of what Memstore is and how it is used:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

Let’s take a closer look at these points.

Configuring Memstore Flushes

Basically, there are two groups of configuraion properties (leaving out region pre-close flushes):

  • First determines when flush should be triggered
  • Second determines when flush should be triggered and updates should be blocked during flushing

First  group is about triggering “regular” flushes which happen in parallel with serving write requests. The properties for configuring flush thresholds are:

  • hbase.hregion.memstore.flush.size
<property>
 <name>hbase.hregion.memstore.flush.size</name>
 <value>134217728</value>
 <description>
 Memstore will be flushed to disk if size of the memstore
 exceeds this number of bytes. Value is checked by a thread that runs
 every hbase.server.thread.wakefrequency.
 </description>
</property>
  • base.regionserver.global.memstore.lowerLimit
<property>
 <name>hbase.regionserver.global.memstore.lowerLimit</name>
 <value>0.35</value>
 <description>Maximum size of all memstores in a region server before
 flushes are forced. Defaults to 35% of heap.
 This value equal to hbase.regionserver.global.memstore.upperLimit causes
 the minimum possible flushing to occur when updates are blocked due to
 memstore limiting.
 </description>
</property>

Note that the first setting is the size per Memstore. I.e. when you define it you should take into account the number of regions served by each RS. When number of RS grows (and you configured the setting when there were few of them) Memstore flushes are likely to be triggered by the second threshold earlier.

Second group of settings is for safety reasons: sometimes write load is so high that flushing cannot keep up with it and since we  don’t want memstore to grow without a limit, in this situation writes are blocked unless memstore has “manageable” size. These thresholds are configured with:

  • hbase.regionserver.global.memstore.upperLimit
<property>
 <name>hbase.regionserver.global.memstore.upperLimit</name>
 <value>0.4</value>
 <description>Maximum size of all memstores in a region server before new
 updates are blocked and flushes are forced. Defaults to 40% of heap.
 Updates are blocked and flushes are forced until size of all memstores
 in a region server hits hbase.regionserver.global.memstore.lowerLimit.
 </description>
</property>
  • hbase.hregion.memstore.block.multiplier
<property>
 <name>hbase.hregion.memstore.block.multiplier</name>
 <value>2</value>
 <description>
 Block updates if memstore has hbase.hregion.block.memstore
 time hbase.hregion.flush.size bytes. Useful preventing
 runaway memstore during spikes in update traffic. Without an
 upper-bound, memstore fills such that when it flushes the
 resultant flush files take a long time to compact or split, or
 worse, we OOME.
 </description>
</property>

Blocking writes on particular RS on its own may be a big issue, but there’s more to that. Since in HBase by designone Region is served by single RS when write load is evenly distributed over the cluster (over Regions) having one such “slow” RS will make the whole cluster work slower (basically, at its speed).

Hint: watch for Memstore Size and Memstore Flush Queue size. Memstore Size ideally should not reach upper Memstore limit and Memstore Flush Queue size should not constantly grow.

Frequent Memstore Flushes

Since we want to avoid blocking writes it may seem a good approach to flush earlier when we are far from “writes-blocking” thresholds. However, this will cause too frequent flushes which can affect read performance and bring additional load to the cluster.

Every time Memstore flush happens one HFile created for each CF. Frequent flushes may create tons of HFiles. Since during reading HBase will have to look at many HFiles, the read speed can suffer.

To prevent opening too many HFiles and avoid read performance deterioration there’s HFiles compaction process. HBase will periodically (when certain configurable thresholds are met) compact multiple smaller HFiles into a big one. Obviously, the more files created by Memstore flushes, the more work (extra load) for the system. More to that: while compaction process is usually performed in parallel with serving other requests, when HBase cannot keep up with compacting HFiles (yes, there are configured thresholds for that too;)) it will block writes on RS again. As mentioned above, this is highly undesirable.

Hint: watch for Compaction Queue size on RSs. In case it is constantly growing you should take actions before it will cause problems.

More on HFiles creation & Compaction can be found here.

So, ideally Memstore should use as much memory as it can (as configured, not all RS heap: there are also in-memory caches), but not cross the upper limit. This picture (screenshot was taken from our SPM monitoring service) shows somewhat good situation:


Memstore Size: Good Situation

“Somewhat”, because we could configure lower limit to be closer to upper, since we barely ever go over it.

Multiple Column Families & Memstore Flush

Memstores of all column families are flushed together (this might change). This means creating N HFiles per flush, one for each CF. Thus, uneven data amount in CF will cause too many HFiles to be created: when Memstore of one CF reaches threshold all Memstores of other CFs are flushed too. As stated above too frequent flush operations and too many HFiles may affect cluster performance.

Hint: in many cases having one CF is the best schema design.

Ref: http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/

### 回答1: 这个错误信息是指在配置项目':app'时出现了问题。 这可能是由于多种因素引起的,比如项目配置文件中的错误、依赖项的版本不兼容等等。要解决这个问题,可以尝试以下几个步骤: 1. 检查项目配置文件是否正确。确保所有配置项都正确设置,并且没有拼写错误或其他语法错误。 2. 确保使用的所有依赖项的版本兼容。可以尝试更新依赖项的版本,或者降低它们的版本,以确保它们能够与您的项目一起正常工作。 3. 尝试清除项目并重新构建。有时候,项目中的缓存文件可能会导致配置错误,清除它们可以解决问题。 如果以上步骤无法解决问题,您可能需要进一步调查错误日志以找出更详细的错误信息。 ### 回答2: 这个问题通常会出现在Android Studio中,它指的是在构建或编译应用程序时发现了错误并无法完成该操作。在遇到这个问题时,我们需要对其进行调试以找到错误的根本原因并解决它。 引起这种问题的原因可能有很多,但最常见的原因是Gradle文件中存在错误。这可能是由于不正确的依赖项、未知的构建选项或语法错误等问题导致的。因此,我们需要仔细检查Gradle文件并修复任何错误。 此外,还有可能是由于Android SDK或NDK缺失或不兼容导致的。如果这是问题的根本原因,我们可以通过打开SDK管理器并下载或升级SDK和NDK来解决它。 另一种可能的情况是Android Studio本身存在问题。在这种情况下,我们可以尝试更新或重新安装Android Studio,并确保使用最新版本。 总之,遇到“a problem occurred configuring project ':app'”这个问题,需要进行系统性的调试和排查,以找到错误的根本原因并解决它。这需要耐心和技巧,并且需要遵循最佳实践和工程技能来解决问题。 ### 回答3: 在Android Studio开发中,有时会遇到“a problem occurred configuring project ':app'”这个错误提示。这个错误的出现是由多种原因造成的,最常见的原因是Gradle配置文件的错误或未正确安装Android Studio的原因。 首先需要检查Gradle配置文件中是否存在语法错误,并确保Gradle配置文件中的所有引用库都已正确引入。同时,需要仔细检查项目的SDK版本和Gradle版本是否符合要求。如果Gradle版本过低或者不兼容,也会导致这个错误的出现。 另外,如果使用的是Windows操作系统,还需要检查系统环境变量是否正确设置。如果系统环境变量设置不正确,Gradle无法正常运行,也会导致这个错误的出现。 如果以上方法都无法解决问题,可以尝试在Android Studio中清除缓存和重新构建项目。在Android Studio菜单栏中选择“Build”-“Clean Project”和“Rebuild Project”,这两个操作会清除任何缓存并重新生成项目文件。 总之,解决“a problem occurred configuring project ':app'”错误有多种方法,要根据具体情况进行分析和解决。需要仔细检查Gradle配置文件、SDK版本、Gradle版本以及系统环境变量等因素,逐一排除可能的问题,最终找到正确的解决方法。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值