#Run CTS Flow
#Try the Sample flow ReadDocumentBasic.flow
#Try the Sample flow ReadDocumentBasic.flow
#Develop a CTS flow using ESP crawler
Need to enable callback
for both the nullWriter and ESPWriter
for both the nullWriter and ESPWriter
#configure a pipeline to collaborate with CTS flow
Get CTS Stages ctsAnnotationsImporter and ctsParser from feedingOverlay Package.
Don't use ctsAnnotationsImporter and related scope processing stage(scopifier and xmlifier),especially for CJK,
It seems not working right now,otherwise will get "FIXML has illegal UTF-8 byte sequences".
Create a pipeline named CrawlerCts using sitesearch pipeline as template
1.Instance a staqe named CtsParserCrawler based on CtsParser
GenerateScopesFromAnnotations 0
2.Put CtsParserCrawler after Docinit
3.Remove the follow stages:
DocumentRetriever
URLProcessor
Decompressor
FormatDetector
SimpleConverter
FlashConverter
PDFConverter
XPSConverter
SearchExportConverter
FastHTMLParser
Don't use ctsAnnotationsImporter and related scope processing stage(scopifier and xmlifier),especially for CJK,
It seems not working right now,otherwise will get "FIXML has illegal UTF-8 byte sequences".
Create a pipeline named CrawlerCts using sitesearch pipeline as template
1.Instance a staqe named CtsParserCrawler based on CtsParser
GenerateScopesFromAnnotations 0
2.Put CtsParserCrawler after Docinit
3.Remove the follow stages:
DocumentRetriever
URLProcessor
Decompressor
FormatDetector
SimpleConverter
FlashConverter
PDFConverter
XPSConverter
SearchExportConverter
FastHTMLParser
#For CJK,Don't remove
LanguageAndEncodingDetector
EncodingNormalizer
#If you need WebAnalyser,Don't remove
WAAttributeLookup
WALinkRankAnchorTextFormatter
WACrawlerLinkFilter
WARankDocument
LanguageAndEncodingDetector
EncodingNormalizer
#If you need WebAnalyser,Don't remove
WAAttributeLookup
WALinkRankAnchorTextFormatter
WACrawlerLinkFilter
WARankDocument
4.The CtsAnnotationsImporter is not necessary If you don't need scope searching
Tips: define your collection name in the Mapper operator
#Using ESP Crawler with CTS
c:/esp/etc/CrawlerGlobalDefaults.xml
...
<section name="cde">
<attrib name="contentdistributors" type="list-string">
<member> localhost:17078 </member>
</attrib>
</section>
...
nctrl stop crawler
nctrl start crawler
Configure crawler's feeding destinations parameter on the Admin GUI
(What the FSIS document said will not work,Because if no feeding destination define,the
export config file will be empty for this group parameters)
name:cde
Target Collection:cntv1;fsistraining.crawlingvideo
Destination:cde
Pause ESP feeding:no
Primary:yes
crawleradmin.exe -G cntv1 > crawler_cntv1.xml
notepad ./cawler_cntv1.xml
#Confirm the feeding destination parameter
section name="feeding">
<section name="cde">
<attrib name="collection" type="string"> cntv1;fsistraining.crawlingvideo </attrib>
<attrib name="destination" type="string"> cde </attrib>
<attrib name="paused" type="boolean"> no </attrib>
<attrib name="primary" type="boolean"> yes </attrib>
</section>
</section>
#change the start_uris and include_uris to define where you are going to craw
<attrib name="start_uris" type="list-string">
<member> http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
<section name="include_uris">
<attrib name="exact" type="list-string">
<member> http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
</section>
Run the flow from VS or FSIS Admin GUI
Remove the crawler datasource definition from the collection cntv1
crawleradmin -f ./cawler_cntv1.xml
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Added collection config(s): Scheduled collection for crawling
#Watching crawler from command winodws
crawleradmin --status
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Collection Status Feed Status Active Sites Stored Docs Doc Rate
-------------------------------------------------------------------------------
cntv10 Idle Feeding 0 1 N/A
cntv11 Idle Feeding 0 1 N/A
cntv12 Idle Feeding 0 1 N/A
cntv13 Idle Feeding 0 1 N/A
cntv14 Idle Feeding 0 1 N/A
cntv8 Idle Feeding 0 2 N/A
cntv9 Idle Feeding 0 5 N/A
0 12 0.0 dps
#Watching doclog
doclog -l
doclog -a http://xxx/xxxx/
#Watching CTS Flow log from
C:/Users/FSIS Service/AppData/Local/FSIS/Nodes/Fsis/ContentEngineNode1/Logs/ContentProcessing
#Watching Crawler log from
C:/esp/var/log/crawler
#Adding Spy Stage into ESP pipeline to monitor
Tips: define your collection name in the Mapper operator
#Using ESP Crawler with CTS
c:/esp/etc/CrawlerGlobalDefaults.xml
...
<section name="cde">
<attrib name="contentdistributors" type="list-string">
<member> localhost:17078 </member>
</attrib>
</section>
...
nctrl stop crawler
nctrl start crawler
Configure crawler's feeding destinations parameter on the Admin GUI
(What the FSIS document said will not work,Because if no feeding destination define,the
export config file will be empty for this group parameters)
name:cde
Target Collection:cntv1;fsistraining.crawlingvideo
Destination:cde
Pause ESP feeding:no
Primary:yes
crawleradmin.exe -G cntv1 > crawler_cntv1.xml
notepad ./cawler_cntv1.xml
#Confirm the feeding destination parameter
section name="feeding">
<section name="cde">
<attrib name="collection" type="string"> cntv1;fsistraining.crawlingvideo </attrib>
<attrib name="destination" type="string"> cde </attrib>
<attrib name="paused" type="boolean"> no </attrib>
<attrib name="primary" type="boolean"> yes </attrib>
</section>
</section>
#change the start_uris and include_uris to define where you are going to craw
<attrib name="start_uris" type="list-string">
<member> http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
<section name="include_uris">
<attrib name="exact" type="list-string">
<member> http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
</section>
Run the flow from VS or FSIS Admin GUI
Remove the crawler datasource definition from the collection cntv1
crawleradmin -f ./cawler_cntv1.xml
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Added collection config(s): Scheduled collection for crawling
#Watching crawler from command winodws
crawleradmin --status
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Collection Status Feed Status Active Sites Stored Docs Doc Rate
-------------------------------------------------------------------------------
cntv10 Idle Feeding 0 1 N/A
cntv11 Idle Feeding 0 1 N/A
cntv12 Idle Feeding 0 1 N/A
cntv13 Idle Feeding 0 1 N/A
cntv14 Idle Feeding 0 1 N/A
cntv8 Idle Feeding 0 2 N/A
cntv9 Idle Feeding 0 5 N/A
0 12 0.0 dps
#Watching doclog
doclog -l
doclog -a http://xxx/xxxx/
#Watching CTS Flow log from
C:/Users/FSIS Service/AppData/Local/FSIS/Nodes/Fsis/ContentEngineNode1/Logs/ContentProcessing
#Watching Crawler log from
C:/esp/var/log/crawler
#Adding Spy Stage into ESP pipeline to monitor
本文介绍如何配置CTS流程并结合ESP爬虫进行文档抓取及处理,涉及CTS流程的搭建、管道配置、爬虫参数设置等关键技术点。
5013

被折叠的 条评论
为什么被折叠?



