中关村在线ZOL搜索页面：找出值得抓取的host

最新推荐文章于 2023-03-31 10:31:29 发布

原创最新推荐文章于 2023-03-31 10:31:29 发布 · 1.4k 阅读

0 ·

CC 4.0 BY-SA版权

Python 同时被 2 个专栏收录

12 篇文章

订阅专栏

爬虫

8 篇文章

订阅专栏

本文详细记录了如何使用自动化脚本从特定网站抓取包含指定关键词的搜索结果页面，并对这些页面的URL进行统计分析的过程。通过设置重复抓取页面的数量，实现了对关键词搜索结果的批量获取，并利用Python脚本对URL进行格式化和去重处理，最终输出每个URL出现的次数。

现需要获取某个论坛的帖子的url。并且需要更具获取的url的统计情况，对出现比较多的url提供解析功能。本文主要对统计部分的功能进行记录。

以中关村在线的搜索结果页面为例，要获取华为和小米搜索结果的前5页进行统计。

//ZolGetNewsInfo.js
var casper = require('casper').create({
    viewportSize: {
        width: 800,
        height: 600
    },
        pageSettings: {
        loadImages: false,
        loadPlugins: false,
        userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/41.0.2272.118 Safari/537.36'
    },
    verbose: true,
    logLevel: 'info',
    stepTimeout: 300000,
    onStepTimeout: function(timeout, stepNum) {
        customEcho("step" + stepNum + ": time out!");
        require('utils').dump(urls);
    },
});

function getUrls() {
    aList =[]
    var aList1 = casper.getElementsAttribute("h3 a", "href");
    aList.push(aList1)
}

var aList = [];
var urls = [];
var time_out_value = 30000;
var i = 1;

casper.start().repeat(5, function() {
    base = casper.cli.get(0);
    url = base +'&page='+i ;
    i++;
    casper.log('current url'+url,'info');
    casper.open(url);
    casper.wait(1000, function(){
            getUrls();
    },time_out_value);
    casper.then(function() {
                require('utils').dump(aList);
    });
});

casper.run();

运行程序结果导入txt文件。

$casperjs ZolGetNewsInfo.js http://search.zol.com.cn/s/article_more.php?kword=xiaomi >zol.txt
$casperjs ZolGetNewsInfo.js http://search.zol.com.cn/s/article_more.php?kword=huawei >>zol.txt

对结果进行处理，格式化和统计

//getUrl.py
def get_url_from_casper_file(file_path):
    url_list = []
    try:
        f = open(file_path)
        for each_line in f:

            if(each_line.startswith("        \"http")):
                url_list.append(each_line.replace("        \"","").replace("\"","").replace(",",""))
    except ValueError:
        pass
    return url_list


def deduplicate_sort_urls(url_list):
    clean_url_list = list(set(url_list))
    clean_url_list.sort()
    return clean_url_list


def get_host_prefix(url):
    return str(url[:20])


url = get_url_from_casper_file('../data/zol.txt')
clean_url = deduplicate_sort_urls(url)
count = 0
host = ''

print len(clean_url)
for item in clean_url:
    if(count==0):
        host = get_host_prefix(item)
        count += 1
    elif(host==get_host_prefix(item)):
        count += 1
    else:
        print 'host: '+ host +', number of urls: ' + str(count)
        count = 1
        host = get_host_prefix(item)
print 'host: '+ host +', number of urls: ' + str(count)

zol搜索结果页面每页10个结果，上面2个url，各取5页，共有10*5*2=100个页面（其中有6个重复被去掉）。
运行python getUrl.py结果如下：

94
host: http://4g.zol.com.cn, number of urls: 2
host: http://dealer.zol.co, number of urls: 1
host: http://gz.zol.com.cn, number of urls: 13
host: http://mobile.zol.co, number of urls: 54
host: http://net.zol.com.c, number of urls: 8
host: http://news.zol.com., number of urls: 1
host: http://pad.zol.com.c, number of urls: 7
host: http://server.zol.co, number of urls: 1
host: http://smartwear.zol, number of urls: 1
host: http://sz.zol.com.cn, number of urls: 4
host: http://tuan.zol.com/, number of urls: 2

Process finished with exit code 0