问题现象
Solr进程OOM,且Cache页面出现大量FieldCacheImpl积压,每个平均大小在50MB左右。
The heap dump file has been generated, you can analyze it with heap analyzer tools like Eclipse Memory Analyzer (https://www.eclipse.org/mat/)
十万个为什么
什么场景下类org.apache.solr.uninverting.FieldCacheImpl的对象会大量激增?FieldCacheImpl和IndexFingerprint有啥关系吗?
- 跟踪日志和代码,在SolrCore.java中发现大量FieldCacheImpl会被加到perSegmentFingerprintCache中
- SolrIndexSearcher.java中getIndexFingerprint方法的注释:gets a cached version of the IndexFingerprint for this searcher
- IndexFingerprint.java中getFingerprint方法的注释:Opens a new realtime searcher and returns it’s (possibly cached) fingerprint
- Fingerprint官方说明:The IndexFingerprint is cached per index searcher. it is quite useless during high throughput indexing.
- 好了,useless,还能说啥。
这里的uninverting是什么?invert又是什么?为什么需要un invert?
Apache Solr uses Lucene to store fields in an inverted index. (倒排索引是为了加快全文检索)This works great for full text search and finding documents quickly. (正排索引是为了加快排序分面)For sorting and faceting, Solr must instead deal with the uninverted value instead of the inverted value. There are two ways that Solr does this:
- If the field is indexed, Solr can uninvert the field on heap and store it in the fieldCache. This takes query time to generate the first time and heap to store the result.
- If the field has docValues enabled, Solr stores the uninverted value at index time. This takes index time to save but when used in a query there is no heap usage since the values can be read from disk/OS memory.
什么是DocValues?
Solr官方对Doc Values的介绍:DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
ES官方对Doc Values的介绍:Doc Values就是一种列式存储结构,默认情况下每个字段的 Doc Values 都是激活的,Doc Values 是在索引时创建的,当字段索引时,Elasticsearch 为了能够快速检索,会把字段的值加入倒排索引中,同时它也会存储该字段的Doc Values
。
日志分析
2019-04-16 21:33:05,708 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Caching fingerprint for searcher:Searcher@2dc0b56c[hot_tb_20190216_shard28_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1(5.3.1):C6453961) Uninverting(_0(5.3.1):C6454208) Uninverting(_2(5.3.1):C6450256) Uninverting(_3(5.3.1):C6450486) Uninverting(_5(5.3.1):C6457048) Uninverting(_4(5.3.1):C6456367) Uninverting(_6(5.3.1):C6449312) Uninverting(_7(5.3.1):C6451240) Uninverting(_8(5.3.1):C6451535) Uninverting(_9(5.3.1):C6453108) Uninverting(_a(5.3.1):C6452285) Uninverting(_b(5.3.1):c6449094) Uninverting(_c(5.3.1):c6450295) Uninverting(_d(5.3.1):c6451743) Uninverting(_e(5.3.1):c6457054) Uninverting(_f(5.3.1):c6455839) Uninverting(_h(5.3.1):c6454142) Uninverting(_g(5.3.1):c6453632) Uninverting(_i(5.3.1):c6453033) Uninverting(_j(5.3.1):c6452370)))} leafReaderContext:LeafReaderContext(FilterLeafReader(Uninverting(_j(5.3.1):c6452370)) docBase=122604638 ord=19) mavVersion:9223372036854775807 | org.apache.solr.core.SolrCore.getIndexFingerprint(SolrCore.java:1625)
2019-04-16 21:33:05,708 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | IndexFingerprint millis:30544.0 result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=1629434765087080448, maxInHash=1629434765087080448, versionsHash=3122101942921649893, numVersions=129057008, numDocs=129057008, maxDoc=6457054} | org.apache.solr.update.IndexFingerprint.getFingerprint(IndexFingerprint.java:97)
2019-04-16 21:33:05,709 | WARN | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | no frame of reference to tell if we’ve missed updates | org.apache.solr.update.PeerSync.sync(PeerSync.java:243)
2019-04-16 21:33:05,709 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | PeerSync Recovery was not successful - trying replication. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:402)
2019-04-16 21:33:05,709 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Starting Replication Recovery. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:410)
2019-04-16 21:33:05,709 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Attempting to replicate from [https://ip:port/solr/hot_tb_20190216_shard28_replica1/]. | org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:139)
2019-04-16 21:33:05,710 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Setting up SPNego auth with config: /opt/huawei/Bigdata/FusionInsight_HD_V100R002C70SPC200/1_9_SolrServer5/etc/jaas.conf | org.apache.solr.client.solrj.impl.Krb5HttpClientConfigurer.configure(Krb5HttpClientConfigurer.java:77)
2019-04-16 21:33:05,751 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Setting up SPNego auth with config: /opt/huawei/Bigdata/FusionInsight_HD_V100R002C70SPC200/1_9_SolrServer5/etc/jaas.conf | org.apache.solr.client.solrj.impl.Krb5HttpClientConfigurer.configure(Krb5HttpClientConfigurer.java:77)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Master’s generation: 11 | org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:297)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Master’s version: 1554149442243 | org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:298)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Slave’s generation: 11 | org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:317)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Slave in sync with master. | org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:341)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | No replay needed. | org.apache.solr.cloud.RecoveryStrategy.replay(RecoveryStrategy.java:538)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Replication Recovery was successful. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:429)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Registering as Active after recovery. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:458)
2019-04-16 21:33:05,768 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | publishing state=active | org.apache.solr.cloud.ZkController.publish(ZkController.java:1259)
2019-04-16 21:33:05,770 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Updating version bucket highest from index after successful recovery. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:526)
2019-04-16 21:33:05,770 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Looking up max value of version field to seed version buckets | org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1609)
2019-04-16 21:33:05,770 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Refreshing highest value of version for 65536 version buckets from index | org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:226)
2019-04-16 21:33:05,773 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Found MAX value 1629434765087080448 from Terms for version in index | org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:236)
2019-04-16 21:33:05,775 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Took 4.0ms to seed version buckets with highest version 1629434765087080448 | org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1602)
2019-04-16 21:33:05,775 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Finished recovery process, successful=[true] | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:530)
分析日志,确认是走了alreadyInSync函数。
fingerprint是什么?
PeerSync中sync(),关闭alreadyInSync()有什么影响或者风险?
SOLR-8690:对alreadyInSync()增加了配置开关doFingerprint。
SOLR-8586:提到可以关闭fingerprint校验以避免发生OOM。
同时经过对代码的分析,目前的情况是会走到alreadyInSync(),Fingerprint校验耗时超过30多秒,但是最终PeerSync Recovery失败,并尝试进行Replication Recovery,参考这里。
2019-04-16 21:33:05,708 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | IndexFingerprint millis:30544.0 result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=1629434765087080448, maxInHash=1629434765087080448, versionsHash=3122101942921649893, numVersions=129057008, numDocs=129057008, maxDoc=6457054} | org.apache.solr.update.IndexFingerprint.getFingerprint(IndexFingerprint.java:97)
2019-04-16 21:33:05,709 | WARN | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | no frame of reference to tell if we’ve missed updates | org.apache.solr.update.PeerSync.sync(PeerSync.java:243)
2019-04-16 21:33:05,709 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | PeerSync Recovery was not successful - trying replication. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:402)
2019-04-16 21:33:05,709 | INFO | recoveryExecutor-3-thread-17-processing-n:10.110.0.125:21116_solr x:hot_tb_20190216_shard28_replica2 s:shard28 c:hot_tb_20190216 r:core_node77 | Starting Replication Recovery. | org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:410)
规避方案
因此,建议利用doFingerprint开关直接跳过alreadyInSync(),直接进行Replication Recovery,这样做的好处是不针对每个Searcher校验Fingerprint,减少内存占用降低发生OOM的概率,还可能会降低整体Recovery耗时。潜在风险是Recovery均采用Replication Recovery,可能增加Replication Recovery次数。
Reference
https://blog.youkuaiyun.com/weixin_42257250/article/details/83934082
https://issues.apache.org/jira/browse/SOLR-8586
https://issues.apache.org/jira/browse/SOLR-8690
https://issues.apache.org/jira/browse/SOLR-9310
https://issues.apache.org/jira/browse/SOLR-9506
https://issues.apache.org/jira/browse/SOLR-12969
https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html
https://www.elastic.co/guide/cn/elasticsearch/guide/current/docvalues-intro.html