Crawling World Wild Web at Scale

最新推荐文章于 2020-04-01 17:57:19 发布

转载最新推荐文章于 2020-04-01 17:57:19 发布 · 505 阅读

文章标签：

#Web Crawling #Web Crawler #Data Mining

Data Mining 专栏收录该内容

1 篇文章

订阅专栏

本文讨论了使用Python进行网页抓取的技术，包括静态页面抓取及动态加载页面的处理方法，并探讨了分布式爬虫的设计。

In this post we discuss some of the existing technologies for scraping, parsing and analyzing web pages. We also talk about some of the challenges software engineers might face while scraping dynamic web pages.

Scraping/Parsing/Mining Web Pages

In September 2012, iPhone 5 was released. We were interested in finding out people reactions to the new iPhone. We wanted to write a simple Python script to scrape and parse online reviews and run a sentiment analysis on the collected reviews. There are many applications for automated opinion mining where companies are interested in finding out their customers reactions to new products release.

For scraping reviews we used Python urllib module. For parsing pages contents and grabbing the required HTML elements we used a Python library called BeautifulSoup. For sentiment analysis, we found an API which is built on top of Python NLTK library for classification. Here is the link to sentiment API. Finally, for sending HTTP requests to text-processing website, we used another Python library called PycURL which is basically a Python interface to libcurl. You can pull/view the opinion mining code from its github repo: Opinion Mining.

Please note that there are other solutions for scraping web. For instance, another open source framework for scraping websites and extracting needed information isScrapy.

Scraping Amazon Reviews

For another research project, we were interested in scraping reviews for George Foreman Grill from Amazon website. For scraping this page, you can open the page on your Chrome browser and use Chrome inspector to inspect the review elements and figure out what HTML elements you need to grab when parsing the web page. If you inspect one of the reviews, you will see that the review is wrapped by a 'div' element using 'reviewText' for its style class.

The code snippet for scraping and parsing Amazon reviews has been shown below:


          amazon_url = "..." #add the link to amazon page here
          ur = urllib.urlopen(amazon_url)
          soup = BeautifulSoup(ur.read())
          posts = soup.select("div.reviewText")
          print posts[0].text   #this prints the first review

See how we grab the div elements for reviews by filtering by the style class. You can check Beautiful Soup Documentation for more details. With above snippet, one can get all reviews successfully.

Scraping Macys Reviews

We also wanted to scrape the reviews for the same product but from Macys wbsite. So, let's try the same approach as shown earlier and see what we get. The only difference is that when you inspect Macys page, you see that each review is wrapped in a span element using the style class of 'BVRRReviewText'. So, we make the following change to our snippet:


          macys_url = "..." #add the link to macys page
          ur = urllib.urlopen(macys_url)
          soup = BeautifulSoup(ur.read())
          posts = soup.select("span.BVRRReviewText")
          print posts[0].text   #this should print first review in theory!

If you try above code, you wont get anything for review content. And more interestingly if you try print ur.read() after the second line and ignore the rest of code, you'll get a None object. Why?

The issue is that the Macys reviews have been populated by Ajax calls from their web server. In other words, this is not a statically loaded html page. So, basically using urllib does not work here.

How to Scrape Dynamically Loaded Web Pages?

To resolve above issue, you need to figure out how Macys populate the reviews by making a POST call to a link on their web server. Then, you need to make that POST call request to populate the reviews. The other possible solution is to use a framework/library to simulate the operation of a browser. Here, we are going to use the PhantomJS which is a headless WebKit scriptable with a JavaScript API to scrape reviews from Macys.

You can download/use PhantomJS on your machine by following these instructions: How to build PhantomJS. Code below is our hack around getting the Macys reviews using PhantomJS:


// Get Macys reviews
var page = require('webpage').create(),
url = 'http://www1.macys.com/shop/product/george-foreman-grp95r-grill-6-servings?ID=797879';
page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
            var results = page.evaluate(function() {
            var allSpans = document.getElementsByTagName('span');
            var reviews = [];
            for(var i = 0; i < allSpans.length; i++) {
                if(allSpans[i].className === 'BVRRReviewText') {
                    reviews.push(allSpans[i].innerHTML);
                }
            }
            return reviews;
            });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

You see in above code how we go after grabbing reviews by getting the span elements with style class 'BVRRReviewText'. Another possible solution that we have found is ghost.py but didn't get chance to test it.

To check out our simple crawler/sentiment analyzer that we have developed for crawling Amazon/Macys reviews, visit our github repository here: opinion-mining.

Scraping Web at Scale

One problem that we have come across often when scraping the web at scale is the time-consuming nature of scraping. Most of the time we need to scrape billions of pages. This arises the need for coming up with a distributed solution to optimize the scraping time.

One simple solution that we came up with for designing a distributed web crawler was to use Work Queue to distribute time-consuming tasks among multiple workers (i.e. web crawlers). So, the basic idea is to use work queues to schedule tasks for scraping/parsing many pages by running multiple workers simultaneously.

You can view a simple example for a distributed web crawler here: Distributed Crawling with RMQ