Crawling World Wild Web at Scale

本文讨论了使用Python进行网页抓取的技术,包括静态页面抓取及动态加载页面的处理方法,并探讨了分布式爬虫的设计。

In this post we discuss some of the existing technologies for scraping, parsing and analyzing web pages. We also talk about some of the challenges software engineers might face while scraping dynamic web pages.

Scraping/Parsing/Mining Web Pages

In September 2012, iPhone 5 was released. We were interested in finding out people reactions to the new iPhone. We wanted to write a simple Python script to scrape and parse online reviews and run a sentiment analysis on the collected reviews. There are many applications for automated opinion mining where companies are interested in finding out their customers reactions to new products release.

For scraping reviews we used Python urllib module. For parsing pages contents and grabbing the required HTML elements we used a Python library called BeautifulSoup. For sentiment analysis, we found an API which is built on top of Python NLTK library for classification. Here is the link to sentiment API. Finally, for sending HTTP requests to text-processing website, we used another Python library called PycURL which is basically a Python interface to libcurl. You can pull/view the opinion mining code from its github repo: Opinion Mining.

Please note that there are other solutions for scraping web. For instance, another open source framework for scraping websites and extracting needed information isScrapy.

Scraping Amazon Reviews

For another research project, we were interested in scraping reviews for George Foreman Grill from Amazon website. For scraping this page, you can open the page on your Chrome browser and use Chrome inspector to inspect the review elements and figure out what HTML elements you need to grab when parsing the web page. If you inspect one of the reviews, you will see that the review is wrapped by a 'div' element using 'reviewText' for its style class.

The code snippet for scraping and parsing Amazon reviews has been shown below:


          amazon_url = "..." #add the link to amazon page here
          ur = urllib.urlopen(amazon_url)
          soup = BeautifulSoup(ur.read())
          posts = soup.select("div.reviewText")
          print posts[0].text   #this prints the first review
      

See how we grab the div elements for reviews by filtering by the style class. You can check Beautiful Soup Documentation for more details. With above snippet, one can get all reviews successfully.

Scraping Macys Reviews

We also wanted to scrape the reviews for the same product but from Macys wbsite. So, let's try the same approach as shown earlier and see what we get. The only difference is that when you inspect Macys page, you see that each review is wrapped in a span element using the style class of 'BVRRReviewText'. So, we make the following change to our snippet:


          macys_url = "..." #add the link to macys page
          ur = urllib.urlopen(macys_url)
          soup = BeautifulSoup(ur.read())
          posts = soup.select("span.BVRRReviewText")
          print posts[0].text   #this should print first review in theory!
      

If you try above code, you wont get anything for review content. And more interestingly if you try print ur.read() after the second line and ignore the rest of code, you'll get a None object. Why?

The issue is that the Macys reviews have been populated by Ajax calls from their web server. In other words, this is not a statically loaded html page. So, basically using urllib does not work here.

How to Scrape Dynamically Loaded Web Pages?

To resolve above issue, you need to figure out how Macys populate the reviews by making a POST call to a link on their web server. Then, you need to make that POST call request to populate the reviews. The other possible solution is to use a framework/library to simulate the operation of a browser. Here, we are going to use the PhantomJS which is a headless WebKit scriptable with a JavaScript API to scrape reviews from Macys.

You can download/use PhantomJS on your machine by following these instructions: How to build PhantomJS. Code below is our hack around getting the Macys reviews using PhantomJS:


// Get Macys reviews
var page = require('webpage').create(),
url = 'http://www1.macys.com/shop/product/george-foreman-grp95r-grill-6-servings?ID=797879';
page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
            var results = page.evaluate(function() {
            var allSpans = document.getElementsByTagName('span');
            var reviews = [];
            for(var i = 0; i < allSpans.length; i++) {
                if(allSpans[i].className === 'BVRRReviewText') {
                    reviews.push(allSpans[i].innerHTML);
                }
            }
            return reviews;
            });
        console.log(results.join('\n'));
    }
    phantom.exit();
});
      

You see in above code how we go after grabbing reviews by getting the span elements with style class 'BVRRReviewText'. Another possible solution that we have found is ghost.py but didn't get chance to test it.

To check out our simple crawler/sentiment analyzer that we have developed for crawling Amazon/Macys reviews, visit our github repository here: opinion-mining.

Scraping Web at Scale

One problem that we have come across often when scraping the web at scale is the time-consuming nature of scraping. Most of the time we need to scrape billions of pages. This arises the need for coming up with a distributed solution to optimize the scraping time.

One simple solution that we came up with for designing a distributed web crawler was to use Work Queue to distribute time-consuming tasks among multiple workers (i.e. web crawlers). So, the basic idea is to use work queues to schedule tasks for scraping/parsing many pages by running multiple workers simultaneously.

You can view a simple example for a distributed web crawler here: Distributed Crawling with RMQ

Last words

So, this was a quick review on scrapping of web pages and some of the challenges you may encounter when you go out to world wild web.

Source: http://www.aioptify.com/crawling.php

【四旋翼无人机】具备螺旋桨倾斜机构的全驱动四旋翼无人机:建模与控制研究(Matlab代码、Simulink仿真实现)内容概要:本文围绕具备螺旋桨倾斜机构的全驱动四旋翼无人机展开研究,重点探讨其系统建模与控制策略,结合Matlab代码与Simulink仿真实现。文章详细分析了无人机的动力学模型,特别是引入螺旋桨倾斜机构后带来的全驱动特性,使其在姿态与位置控制上具备更强的机动性与自由度。研究涵盖了非线性系统建模、控制器设计(如PID、MPC、非线性控制等)、仿真验证及动态响应分析,旨在提升无人机在复杂环境下的稳定性和控制精度。同时,文中提供的Matlab/Simulink资源便于读者复现实验并进一步优化控制算法。; 适合人群:具备一定控制理论基础和Matlab/Simulink仿真经验的研究生、科研人员及无人机控制系统开发工程师,尤其适合从事飞行器建模与先进控制算法研究的专业人员。; 使用场景及目标:①用于全驱动四旋翼无人机的动力学建模与仿真平台搭建;②研究先进控制算法(如模型预测控制、非线性控制)在无人机系统中的应用;③支持科研论文复现、课程设计或毕业课题开发,推动无人机高机动控制技术的研究进展。; 阅读建议:建议读者结合文档提供的Matlab代码与Simulink模型,逐步实现建模与控制算法,重点关注坐标系定义、力矩分配逻辑及控制闭环的设计细节,同时可通过修改参数和添加扰动来验证系统的鲁棒性与适应性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值