HBase change causes Kylin restart problem (Kylin 2.0 HBase 0.98)_interrupted after 14 tries on 35-优快云博客

本文链接：https://blog.youkuaiyun.com/wangxiaojing123/article/details/97629821

本文深入探讨了Kylin2.0在HBase0.98上遇到的重启需求问题，通过对比Kylin3.0与HBase1.4.8的兼容性，定位并修复了HBase重试机制中的逻辑错误，提供了详细的解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Background: We have some Kylin clusters (Kylin 2.0 HBase 0.98) must restart all Kylin nodes after HBase delete nodes, change RSGroup, change Region server or HTable splitting operations. In theory, these operations can self-recovery after temporary unavailability or transparent to the upper application, but these Kylin clusters need reboot after HBase changes to continue to provide corresponding services, which has a greater impact on the upper services, and the operations of HBase will become more complex. But the Kylin cluster of another online version (Kylin 3.0 HBase 1.4.8) did not have this problem, which had troubled one of our teams for a long time. This blog summarizes how to investigate and solve this problem. If you faced it, you don’t have to be confused like us.

Note: You can skip the first and second steps (Process of troubleshooting ) and start reading directly from the third step (checking the difference between HBase 0.98 and 1.4.8 code), and the third step is the real cause of this problem.

1、Positioning problem

First find a cube on the Kylin3.0 and 2.0 clusters ,and find a hbase table from the corresponding cube storage, then change the table RSGroup to replace the storage machine. It is found that Kylin 3.0 can still query this cube, but 2.0 can’t get the result. The exception thrown by Kylin 2.0 is as follows:

2019-07-18 16:43:59,148 ERROR [http-bio-8088-exec-438] controller.BasicController:54 :
org.apache.kylin.rest.exception.InternalErrorException: Error while executing SQL "select * from DPS_DATA_CENTER.SYS_PROBE limit 2": java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1200108: row '' on table 'DPS_DATA_CENTER:KYLIN_N59AHNZIMB' at region=DPS_DATA_CENTER:KYLIN_N59AHNZIMB,,1542085088058.6b1f069c03aa1cfc6649b6762bc79451., hostname=bigdata-dnn-hbase33.gs.com,60020,1551705541907, seqNum=3
        at org.apache.kylin.rest.service.QueryService.doQueryWithCache(QueryService.java:402)
        at org.apache.kylin.rest.controller.QueryController.query(QueryController.java:71)
        at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:221)
        at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:136)
        at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
        at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:743)
        at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:672)
        at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:82)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:933)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:650)
        at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

Through the exception, we can see that querying a region of this table still went to the old machine (bigdata-dnn-hbase33.gs.com) query, in theory, this query will automatically retry to the new machine query, but it is obviously still the previous node query. According to previous experience and other HBase users do not have this problem, so our habitual guess is that
we having not configued the HBase retry policy (actually configured and configured 35 times) or Kylin calling code has some problems on Kylin 2.0, so have the following 2.0 Comparison of code and 3.0 code.

2、Troubleshoot Kylin 2.0 and 3.0 code

First, locate the Kylin code logic of the query service (mainly Kylin calls the coprocessor of the node where the corresponding hbase table is located by rpc), compare the code of Kylin2.0 and Kylin3.0 respectively, and find that although the code on both sides is slightly different in writing. However, there is no difference between the lowest level of communication with HBase. The following attempts were made separately:

Debug 3.0 service , intended to determine whether Kylin itself will obtain the location of the region or perform a corresponding failed retry (actually Kylin’s code does not do retry, all of them are Hbase’s own failure retry policy);
Change the number of HBase failure retries corresponding to Kylin 3.0 to 0. (After changing to 0, the data cannot be obtained correctly regardless of whether the RSGroup is moved or not. The exception is retries greater than 0, indicating that the configured HBase failed retry parameter is effective. )；
Modify the failure retry mechanism of 3.0 to 1 (after changing to 1, you can correctly query and get the result before moving, you can’t query the result after the move and throw the exception and try to retry more than 1 time);
The failure retry mechanism for modifying 3.0 is greater than 1 (you can normally query and get results before and after the move.)

After the above 4 attempts and Debug. We get the following conclusion.First, the HBase failed retry configuration is effective. Second, the first query will fail after moving the RSGroup, it needs to be tried twice, and the attempt code itself is HBase. The basic exclusion is the problem of Kylin code. Because Kylin 2.0 cluster has just taken over from other teams and has run a lot of production cube on it, there is no available test environment for the time being，we can only troubleshoot the problem by reading the code and debugging 3.0 cluster, so it is very time consuming.

3、Check the difference between Hbase 0.98 and 1.4.8 code (the real cause of the problem)

All previous guesses (constructed or Kylin two versions of the code call difference) are basically ruled out, can only revisit the exception thrown by the 2.0 query. It’s strange that the exception thrown by 2.0 is a java.net.SocketTimeoutException, but the 3.0 failed retry mechanism throws the “Call exception, tries=” exception. We determined to check the HBase retry code. Sure enough, the HBase’s RpcRetryingCaller class found the reason java.net.SocketTimeoutException was thrown. HBase has two logics to determine whether to continue retrying :

（1）If the number of retries reaches the configured number, it will not be retried and throws the exception code of “Call exception, tries=”, etc. (the attempt in 3.0 also illustrates this);
（2）If the entire duration of the retry process is greater than the configured timeout period, it will not be retried and throws the exception code of the keyword “java.net.SocketTimeoutException” (this is exactly the exception in 2.0).

Hbase’s problematic code is also the code for this timeout exception. Next, let’s take a look at the specific code for the failed retry:

/**
 * Retries if invocation fails.
 * @param callTimeout Timeout for this call
 * @param callable The {@link RetryingCallable} to run.
 * @return an object of type T
 * @throws IOException if a remote or network exception occurs
 * @throws RuntimeException other unspecified error
 */
@edu.umd.cs.findbugs.annotations.SuppressWarnings
    (value = "SWL_SLEEP_WITH_LOCK_HELD", justification = "na")
public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException, RuntimeException {
  this.callTimeout = callTimeout;
  List<RetriesExhaustedException.ThrowableWithExtraContext> exceptions =
    new ArrayList<RetriesExhaustedException.ThrowableWithExtraContext>();
  this.globalStartTime = EnvironmentEdgeManager.currentTimeMillis();
  for (int tries = 0;; tries++) {
    long expectedSleep = 0;
    try {
      beforeCall();
      callable.prepare(tries != 0); 
      return callable.call();
    } catch (Throwable t) {
      if (tries > startLogErrorsCnt) {
        LOG.info("Call exception, tries=" + tries + ", retries=" + retries + ", retryTime=" +
            (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) + "ms, msg="
            + callable.getExceptionMessageAdditionalDetail());
      }
      
      t = translateException(t);
      callable.throwable(t, retries != 1);
      RetriesExhaustedException.ThrowableWithExtraContext qt =
          new RetriesExhaustedException.ThrowableWithExtraContext(t,
              EnvironmentEdgeManager.currentTimeMillis(), toString());
      exceptions.add(qt);
      ExceptionUtil.rethrowIfInterrupt(t);
      if (tries >= retries - 1) {
        throw new RetriesExhaustedException(tries, exceptions);
      }
      expectedSleep = callable.sleep(pause, tries);
 
      // If, after the planned sleep, there won't be enough time left, we stop now.
      long duration = singleCallDuration(expectedSleep);//Problematic code, implementation logic can see the source code of singleCallDuration method
      if (duration > this.callTimeout) {// Due to the problems implemented by the singleCallDuration method and the user's configuration, the duration will always be greater than the callTimeout, so the real failure cannot be retried.
        String msg = "callTimeout=" + this.callTimeout + ", callDuration=" + duration +
            ": " + callable.getExceptionMessageAdditionalDetail();
        throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t));
      }
    } finally {
      afterCall();
    }
    try {
      Thread.sleep(expectedSleep);
    } catch (InterruptedException e) {
      throw new InterruptedIOException("Interrupted after " + tries + " tries  on " + retries);
    }
  }
}

singleCallDuration method source code:

/**
 * @param expectedSleep
 * @return Calculate how long a single call took
 */
private long singleCallDuration(final long expectedSleep) {
/*
* The original meaning of the code here should be how  long it takes to get to the next attempt. That is, from the start to the present + the time required for the next failed retry, but here he also added the time of rpcTimeout (this value, user
* can configure in hbase-site.xml, and we configured the value is  1200000). Hbase 1.4.8's code is not added the rpcTimeout,
* in fact, HBase 1.4.8 is right ,it should not be added the rpcTimeout.
*
*/
  int timeout = rpcTimeout > 0 ? rpcTimeout : MIN_RPC_TIMEOUT;
  return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
    + timeout + expectedSleep;
}

Summarize the reasons:
When Hbase fails to retry, it is judged whether it times out. First, it will take the time to complete the retry from the execution to the next failure, but this is a problem in calculating the logic of the duration (the time it has spent + rpctimeout (1200000) + expectedSleep (time of the next attempt))), and then compare the duration time with callTimeout (default 1200000, which is exactly equal to the configured rpctimeout), see the duration calculation formula knows that if the configured time is equal to or greater than the callTimeout, You will never be able to retry and try again. HBase 1.4.8 has fixed this problem and the calculation formula for duration is changed to (time spent + expectedSleep).

4、Solve the problem

Step 1:Modify the hbase.rpc.timeout configuration of all HBase clients in the Kylin 2.0 cluster, change the time from 1200000 to 120000, restart the Kylin cluster, and move the RSGropu validate again,we can query and get the right result without restarting the Kylin node.
The second step: Repair HBase here timeout judgment code logic, completely solve the problem.
Add one more point: Since a call request may contain multiple RPCs, the value of rpctimeout is preferably set less than the callTimeout time.

On the way of using Kylin, if you have similar problems, I hope this article can help.