SOLR Performance and SolrJ(2)Compress and Post File
Then the idea is to compress the post data first and then send that to SOLR indexer or we can have SOLR cloud and shard to improve the band of the indexer machines.
Finally, we decide to generate the XML file, compress that and SCP the compressed file to indexer machine. On the indexer machine, we can have a monitor to unzip the file and post to the SOLR localhost. That is localhost, not the network band then.
I am using XMLWriter to generate the SOLR XML, similar code here
public function __construct($ioc)
{
$this->ioc = $ioc;
$logger = $this->ioc->getService("logger");
$config = $this->ioc->getService("config");
$this->xmlWriter = new \XMLWriter();
}
public function addStart($file){
$this->xmlWriter->openURI($file);
$this->xmlWriter->setIndent(true);
$this->xmlWriter->startElement('update');
}
public function __construct($ioc)
{
$this->ioc = $ioc;
$logger = $this->ioc->getService("logger");
$config = $this->ioc->getService("config");
$this->xmlWriter = new \XMLWriter();
}
public function addStart($file){
$this->xmlWriter->openURI($file);
$this->xmlWriter->setIndent(true);
$this->xmlWriter->startElement('update');
}
Zip the file and SCP to the target
system("gzip -f {$file_path}");
system("scp -i /share/ec2.id -o StrictHostKeyChecking=no {$file_path}.gz ec2-user@{$ip}:" . $this->XML_FOLDER);
unlink($file_path . ".gz");
On the target machine, we will watch the directory and exec the post curl request to the SOLR server. PHP is really easy in this situation.
$delta_files = array();
exec('ls -tr --time=ctime /mnt/ad_feed/*.gz 2>/dev/null', $delta_files);
$delta_count = count($delta_files);
if(DEBUG) echo "delta count: ".$delta_count."\n";
if($delta_count == 0) continue;
Check how many process are working
$curl_processes = array();
exec('ps -ef | grep "curl --fail http://localhost:8983/job/update -d @/mnt/ad_feed/" | grep -v grep', $curl_processes);
$curl_count = count($curl_processes);
if(DEBUG) echo "curl count: ".$curl_count."\n";
if($curl_count >= MAX_PROCS) continue;
Execute the command in the backend, then we can use exec to execute multiple process
$curl_command = "php delta_curl.php $cur_file > /dev/null 2>&1 &"; //parallel processes
exec($curl_command);
Post XML file
exec("curl --fail http://localhost:8983/job/update -d @{$argv[1]} -H Content-type:application/xml", $output, $status);
if(0 != $status)
{
send_delta_alert($argv[1]);
}
unlink($argv[1]);
The sample format of the XML will be as follow:
<update>
<delete>
<id>2136083108</id>
<id>2136083113</id>
<id>2136083114</id>
</delete>
<add>
<doc>
<field name="id">2136083xx</field>
<field name="customer_id">2xx</field>
<field name="pool_id">20xx</field>
<field name="source_id">23xx</field>
<field name="campaign_id">3xxx</field>
<field name="segment_id">0</field>
<field name="job_reference">468-1239-xxxx4</field>
<field name="title"><![CDATA[CDL-A xxxxx ]]></field>
<field name="url"><![CDATA[http://www.xxxxxx]]></field>
<field name="company_id">11xxx7</field>
<field name="company">Hub xxxxx</field>
<field name="title_com">CDL-xxxx</field>
<field name="campaign_com">3396xxx</field>
<field name="zipcode">3xxxx</field>
<field name="cities">Atlanta,GA</field>
<field name="jlocation">33.8444,-84.4741</field>
<field name="state_id">11</field>
<field name="cpc">125</field>
<field name="reg_cpc">130</field>
<field name="qq_multiplier">0</field>
<field name="j2c_apply">0</field>
<field name="created">2016-09-02T06:02:42Z</field>
<field name="posted">2016-09-02T06:02:42Z</field>
<field name="experience">2</field>
<field name="salary">150</field>
<field name="education">2</field>
<field name="jobtype">1</field>
<field name="quality_score">60</field>
<field name="boost_factor">20.81</field>
<field name="industry">20</field>
<field name="industries">20</field>
<field name="paused">false</field>
<field name="email"></field>
<field name="srcseg_id">23xx</field>
<field name="srccamp_id">23xxx</field>
<field name="top_spot_type">7</field>
<field name="top_spot_industries">20</field>
<field name="is_ad">2</field>
<field name="daily_capped">0</field>
<field name="mobile_friendly">1</field>
<field name="excluded_company">false</field>
</doc>
</add>
</update>
References:
Then the idea is to compress the post data first and then send that to SOLR indexer or we can have SOLR cloud and shard to improve the band of the indexer machines.
Finally, we decide to generate the XML file, compress that and SCP the compressed file to indexer machine. On the indexer machine, we can have a monitor to unzip the file and post to the SOLR localhost. That is localhost, not the network band then.
I am using XMLWriter to generate the SOLR XML, similar code here
public function __construct($ioc)
{
$this->ioc = $ioc;
$logger = $this->ioc->getService("logger");
$config = $this->ioc->getService("config");
$this->xmlWriter = new \XMLWriter();
}
public function addStart($file){
$this->xmlWriter->openURI($file);
$this->xmlWriter->setIndent(true);
$this->xmlWriter->startElement('update');
}
public function __construct($ioc)
{
$this->ioc = $ioc;
$logger = $this->ioc->getService("logger");
$config = $this->ioc->getService("config");
$this->xmlWriter = new \XMLWriter();
}
public function addStart($file){
$this->xmlWriter->openURI($file);
$this->xmlWriter->setIndent(true);
$this->xmlWriter->startElement('update');
}
Zip the file and SCP to the target
system("gzip -f {$file_path}");
system("scp -i /share/ec2.id -o StrictHostKeyChecking=no {$file_path}.gz ec2-user@{$ip}:" . $this->XML_FOLDER);
unlink($file_path . ".gz");
On the target machine, we will watch the directory and exec the post curl request to the SOLR server. PHP is really easy in this situation.
$delta_files = array();
exec('ls -tr --time=ctime /mnt/ad_feed/*.gz 2>/dev/null', $delta_files);
$delta_count = count($delta_files);
if(DEBUG) echo "delta count: ".$delta_count."\n";
if($delta_count == 0) continue;
Check how many process are working
$curl_processes = array();
exec('ps -ef | grep "curl --fail http://localhost:8983/job/update -d @/mnt/ad_feed/" | grep -v grep', $curl_processes);
$curl_count = count($curl_processes);
if(DEBUG) echo "curl count: ".$curl_count."\n";
if($curl_count >= MAX_PROCS) continue;
Execute the command in the backend, then we can use exec to execute multiple process
$curl_command = "php delta_curl.php $cur_file > /dev/null 2>&1 &"; //parallel processes
exec($curl_command);
Post XML file
exec("curl --fail http://localhost:8983/job/update -d @{$argv[1]} -H Content-type:application/xml", $output, $status);
if(0 != $status)
{
send_delta_alert($argv[1]);
}
unlink($argv[1]);
The sample format of the XML will be as follow:
<update>
<delete>
<id>2136083108</id>
<id>2136083113</id>
<id>2136083114</id>
</delete>
<add>
<doc>
<field name="id">2136083xx</field>
<field name="customer_id">2xx</field>
<field name="pool_id">20xx</field>
<field name="source_id">23xx</field>
<field name="campaign_id">3xxx</field>
<field name="segment_id">0</field>
<field name="job_reference">468-1239-xxxx4</field>
<field name="title"><![CDATA[CDL-A xxxxx ]]></field>
<field name="url"><![CDATA[http://www.xxxxxx]]></field>
<field name="company_id">11xxx7</field>
<field name="company">Hub xxxxx</field>
<field name="title_com">CDL-xxxx</field>
<field name="campaign_com">3396xxx</field>
<field name="zipcode">3xxxx</field>
<field name="cities">Atlanta,GA</field>
<field name="jlocation">33.8444,-84.4741</field>
<field name="state_id">11</field>
<field name="cpc">125</field>
<field name="reg_cpc">130</field>
<field name="qq_multiplier">0</field>
<field name="j2c_apply">0</field>
<field name="created">2016-09-02T06:02:42Z</field>
<field name="posted">2016-09-02T06:02:42Z</field>
<field name="experience">2</field>
<field name="salary">150</field>
<field name="education">2</field>
<field name="jobtype">1</field>
<field name="quality_score">60</field>
<field name="boost_factor">20.81</field>
<field name="industry">20</field>
<field name="industries">20</field>
<field name="paused">false</field>
<field name="email"></field>
<field name="srcseg_id">23xx</field>
<field name="srccamp_id">23xxx</field>
<field name="top_spot_type">7</field>
<field name="top_spot_industries">20</field>
<field name="is_ad">2</field>
<field name="daily_capped">0</field>
<field name="mobile_friendly">1</field>
<field name="excluded_company">false</field>
</doc>
</add>
</update>
References:
本文介绍了一种通过压缩文件并发送至SOLR进行索引的方法,利用SOLRCloud和分片提高索引器带宽。文中详细展示了如何生成SOLRXML文件、压缩并传输文件到索引器,以及如何监控和处理文件。
2496

被折叠的 条评论
为什么被折叠?



