SOLR Performance and SolrJ(2)Compress and Post File

本文介绍了一种通过压缩文件并发送至SOLR进行索引的方法,利用SOLRCloud和分片提高索引器带宽。文中详细展示了如何生成SOLRXML文件、压缩并传输文件到索引器,以及如何监控和处理文件。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

SOLR Performance and SolrJ(2)Compress and Post File

Then the idea is to compress the post data first and then send that to SOLR indexer or we can have SOLR cloud and shard to improve the band of the indexer machines.

Finally, we decide to generate the XML file, compress that and SCP the compressed file to indexer machine. On the indexer machine, we can have a monitor to unzip the file and post to the SOLR localhost. That is localhost, not the network band then.

I am using XMLWriter to generate the SOLR XML, similar code here
public function __construct($ioc)
{
$this->ioc = $ioc;

$logger = $this->ioc->getService("logger");
$config = $this->ioc->getService("config");

$this->xmlWriter = new \XMLWriter();
}

public function addStart($file){
$this->xmlWriter->openURI($file);
$this->xmlWriter->setIndent(true);

$this->xmlWriter->startElement('update');
}
public function __construct($ioc)
{
$this->ioc = $ioc;

$logger = $this->ioc->getService("logger");
$config = $this->ioc->getService("config");

$this->xmlWriter = new \XMLWriter();
}

public function addStart($file){
$this->xmlWriter->openURI($file);
$this->xmlWriter->setIndent(true);

$this->xmlWriter->startElement('update');
}

Zip the file and SCP to the target
system("gzip -f {$file_path}");
system("scp -i /share/ec2.id -o StrictHostKeyChecking=no {$file_path}.gz ec2-user@{$ip}:" . $this->XML_FOLDER);
unlink($file_path . ".gz");

On the target machine, we will watch the directory and exec the post curl request to the SOLR server. PHP is really easy in this situation.

$delta_files = array();
exec('ls -tr --time=ctime /mnt/ad_feed/*.gz 2>/dev/null', $delta_files);
$delta_count = count($delta_files);
if(DEBUG) echo "delta count: ".$delta_count."\n";
if($delta_count == 0) continue;

Check how many process are working
$curl_processes = array();
exec('ps -ef | grep "curl --fail http://localhost:8983/job/update -d @/mnt/ad_feed/" | grep -v grep', $curl_processes);
$curl_count = count($curl_processes);
if(DEBUG) echo "curl count: ".$curl_count."\n";
if($curl_count >= MAX_PROCS) continue;

Execute the command in the backend, then we can use exec to execute multiple process
$curl_command = "php delta_curl.php $cur_file > /dev/null 2>&1 &"; //parallel processes
exec($curl_command);

Post XML file
exec("curl --fail http://localhost:8983/job/update -d @{$argv[1]} -H Content-type:application/xml", $output, $status);

if(0 != $status)
{
send_delta_alert($argv[1]);
}
unlink($argv[1]);

The sample format of the XML will be as follow:
<update>
<delete>
<id>2136083108</id>
<id>2136083113</id>
<id>2136083114</id>
</delete>
<add>
<doc>
<field name="id">2136083xx</field>
<field name="customer_id">2xx</field>
<field name="pool_id">20xx</field>
<field name="source_id">23xx</field>
<field name="campaign_id">3xxx</field>
<field name="segment_id">0</field>
<field name="job_reference">468-1239-xxxx4</field>
<field name="title"><![CDATA[CDL-A xxxxx ]]></field>
<field name="url"><![CDATA[http://www.xxxxxx]]></field>
<field name="company_id">11xxx7</field>
<field name="company">Hub xxxxx</field>
<field name="title_com">CDL-xxxx</field>
<field name="campaign_com">3396xxx</field>
<field name="zipcode">3xxxx</field>
<field name="cities">Atlanta,GA</field>
<field name="jlocation">33.8444,-84.4741</field>
<field name="state_id">11</field>
<field name="cpc">125</field>
<field name="reg_cpc">130</field>
<field name="qq_multiplier">0</field>
<field name="j2c_apply">0</field>
<field name="created">2016-09-02T06:02:42Z</field>
<field name="posted">2016-09-02T06:02:42Z</field>
<field name="experience">2</field>
<field name="salary">150</field>
<field name="education">2</field>
<field name="jobtype">1</field>
<field name="quality_score">60</field>
<field name="boost_factor">20.81</field>
<field name="industry">20</field>
<field name="industries">20</field>
<field name="paused">false</field>
<field name="email"></field>
<field name="srcseg_id">23xx</field>
<field name="srccamp_id">23xxx</field>
<field name="top_spot_type">7</field>
<field name="top_spot_industries">20</field>
<field name="is_ad">2</field>
<field name="daily_capped">0</field>
<field name="mobile_friendly">1</field>
<field name="excluded_company">false</field>
</doc>
</add>
</update>

References:
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值