现需要获取某个论坛的帖子的url。并且需要更具获取的url的统计情况,对出现比较多的url提供解析功能。本文主要对统计部分的功能进行记录。
以中关村在线的搜索结果页面为例,要获取华为和小米搜索结果的前5页进行统计。
//ZolGetNewsInfo.js
var casper = require('casper').create({
viewportSize: {
width: 800,
height: 600
},
pageSettings: {
loadImages: false,
loadPlugins: false,
userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/41.0.2272.118 Safari/537.36'
},
verbose: true,
logLevel: 'info',
stepTimeout: 300000,
onStepTimeout: function(timeout, stepNum) {
customEcho("step" + stepNum + ": time out!");
require('utils').dump(urls);
},
});
function getUrls() {
aList =[]
var aList1 = casper.getElementsAttribute("h3 a", "href");
aList.push(aList1)
}
var aList = [];
var urls = [];
var time_out_value = 30000;
var i = 1;
casper.start().repeat(5, function() {
base = casper.cli.get(0);
url = base +'&page='+i ;
i++;
casper.log('current url'+url,'info');
casper.open(url);
casper.wait(1000, function(){
getUrls();
},time_out_value);
casper.then(function() {
require('utils').dump(aList);
});
});
casper.run();
运行程序结果导入txt文件。
$casperjs ZolGetNewsInfo.js http://search.zol.com.cn/s/article_more.php?kword=xiaomi >zol.txt
$casperjs ZolGetNewsInfo.js http://search.zol.com.cn/s/article_more.php?kword=huawei >>zol.txt
对结果进行处理,格式化和统计
//getUrl.py
def get_url_from_casper_file(file_path):
url_list = []
try:
f = open(file_path)
for each_line in f:
if(each_line.startswith(" \"http")):
url_list.append(each_line.replace(" \"","").replace("\"","").replace(",",""))
except ValueError:
pass
return url_list
def deduplicate_sort_urls(url_list):
clean_url_list = list(set(url_list))
clean_url_list.sort()
return clean_url_list
def get_host_prefix(url):
return str(url[:20])
url = get_url_from_casper_file('../data/zol.txt')
clean_url = deduplicate_sort_urls(url)
count = 0
host = ''
print len(clean_url)
for item in clean_url:
if(count==0):
host = get_host_prefix(item)
count += 1
elif(host==get_host_prefix(item)):
count += 1
else:
print 'host: '+ host +', number of urls: ' + str(count)
count = 1
host = get_host_prefix(item)
print 'host: '+ host +', number of urls: ' + str(count)
zol搜索结果页面 每页10个结果,上面2个url,各取5页,共有10*5*2=100个页面(其中有6个重复被去掉)。
运行python getUrl.py结果如下:
94
host: http://4g.zol.com.cn, number of urls: 2
host: http://dealer.zol.co, number of urls: 1
host: http://gz.zol.com.cn, number of urls: 13
host: http://mobile.zol.co, number of urls: 54
host: http://net.zol.com.c, number of urls: 8
host: http://news.zol.com., number of urls: 1
host: http://pad.zol.com.c, number of urls: 7
host: http://server.zol.co, number of urls: 1
host: http://smartwear.zol, number of urls: 1
host: http://sz.zol.com.cn, number of urls: 4
host: http://tuan.zol.com/, number of urls: 2
Process finished with exit code 0
出现次数一两次的就不单独进行支持了, 在抽样中他们出现的频率较少,由于是随机抽样,可以认为他们在今后出现的可能性也较少。这部分url不会太影响抓取率。