solr文档索引最佳实践
@(OTHERS)[solr]
solr的文档生成后,需要将其提交到solr集群,提交的方法有以下三种:
(一)直接提交
每生成一个文档就直接提交至solr:
CloudSolrClient client = new CloudSolrClient(SOLR_ZK);
SolrInputDocument doc2 = new SolrInputDocument();
doc2.addField("id", "ljhtest3");
doc2.addField("key_ss", map);
client.add(doc2);
client.commit();
即每add一次就commit一次,这种实现方案实现简单,但性能不高。
* 注意,不只是commit()效率不高,client.add()的效率也是非常低的,因此需要将所有文档先add进一个collection,然后client.add(collection) *
List<SolrInputDocument> docList = new LinkedList<SolrInputDocument>();
for (int i = 0; i < DOC_NUM; i++) {
SolrInputDocument doc2 = new SolrInputDocument();
doc2.addField("id", "way2" + i);
Set set = new HashSet();
for (String s : "abc,edf,kkk,lll".split(",")) {
set.add(s);
}
Map map = new HashMap();
map.put("set", set);
doc2.addField("key_ss", map);
docList.add(doc2);
}
client.add(docList);
client.commit();
(二)AutoCommit
可以在solrConfig.xml中的updateHandler设置自动提交机制:
<!-- Enables a transaction log, used for real-time get, durability, and
and solr cloud replica recovery. The log can grow as big as
uncommitted changes to the index, so use of a hard autoCommit
is recommended (see below).
"dir" - the target directory for transaction logs, defaults to the
solr data directory.
"numVersionBuckets" - sets the number of buckets used to keep
track of max version values when checking for re-ordered
updates; increase this value to reduce the cost of
synchronizing access to version buckets during high-volume
indexing, this requires 8 bytes (long) * numVersionBuckets
of heap space per Solr core.
-->
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
<int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>
<!-- AutoCommit
Perform a hard commit automatically under certain conditions.
Instead of enabling autoCommit, consider using "commitWithin"
when adding documents.
http://wiki.apache.org/solr/UpdateXmlMessages
maxDocs - Maximum number of documents to add since the last
commit before automatically triggering a new commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automatically
triggering a new commit.
openSearcher - if false, the commit causes recent index changes
to be flushed to stable storage, but does not cause a new
searcher to be opened to make those changes visible.
If the updateLog is enabled, then it's highly recommended to
have some sort of hard autoCommit to limit the log size.
-->
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<!-- softAutoCommit is like autoCommit except it causes a
'soft' commit which only ensures that changes are visible
but does not ensure that data is synced to disk. This is
faster and more near-realtime friendly than a hard commit.
-->
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
可以指定多长时间间隔或者多少文档就会提交一次。
有hard & soft2种,后者不会将数据同步到disk。
(三) commitWithin
client.add(DEFAULT_COLLECTION, doc,1000);
client.add()的文档会在1000ms内被提交到solr中。
(四)建议及结论
1、单线程情况
1、将需要提交的文档add到一个collection中。
2、client add这个collection,而不是add文档 。
3、client指定commitWithin参数。
参考代码如下:
List<SolrInputDocument> docList = new LinkedList<SolrInputDocument>();
for (int i = 0; i < DOC_NUM; i++) {
SolrInputDocument doc2 = new SolrInputDocument();
doc2.addField("id", "way2" + i);
Set set = new HashSet();
for (String s : "abc,edf,kkk,lll".split(",")) {
set.add(s);
}
Map map = new HashMap();
map.put("set", set);
doc2.addField("key_ss", map);
docList.add(doc2);
}
client.add(docList);
client.commit();
2、多线程情况
(1)hbase写时要有多线程
(2)coprocessor会在多个分区中并行执行。