Scraping with Python Selenium and PhantomJS

本文介绍了一种使用PhantomJS和Selenium进行JavaScript动态加载页面抓取的方法。通过实例展示了如何抓取包含大量JavaScript的网页,并解析获取到的数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文链接

In previous posts, I covered scraping using mechanize as the browser. Sometimes though a site uses so much Javascript to dynamically render its pages that using a tool like mechanize (which can't handle Javascript) isn't really feasable. For these cases, we haveto use a browser that can run the Javascript required to generate the pages.

Overview

PhantomJS is a headless (non-gui) browser. Selenium is a tool for automating browsers. In this post, we'll use the two together to scrape a Javascript heavy site. First we'll navigate to the site and then, after the HTML has been dynamically generated, we'll feed it into BeautifulSoup for parsing.

First let's set up our environment by installing PhantomJS along with the Selenium bindings for Python:

$ mkdir scraper && cd scraper
$ brew install phantomjs
$ virtualenv venv
$ source venv/bin/activate
$ pip install selenium

Now, let's look at the site we'll use for our example, the job search page for the companyL-3 Klein Associates. They use the Taleo ApplicantTracking System and the pages are almost entirely generated via Javascript:

https://l3com.taleo.net/careersection/l3_ext_us/jobsearch.ftl

In this post, we'll develop a script that can scrape, and then print out, all of the jobs listed on their Applicant Tracking System.

Let's get started.

Implementation

First, let's sketch out our class, TaleoJobScraper. In the constructorwe'll instantiate a webdriver for PhantomJS. Our main method will be scrape(). It will callscrape_job_links() to iterate through the job listings, and then call driver.quit() onceit's complete.

#!/usr/bin/env python

import re, urlparse

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

link = 'https://l3com.taleo.net/careersection/l3_ext_us/jobsearch.ftl'

class TaleoJobScraper(object):
    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)

    def scrape(self):
        jobs = self.scrape_job_links()
        for job in jobs:
            print job

        self.driver.quit()

if __name__ == '__main__':
    scraper = TaleoJobScraper()
    scraper.scrape()

Now let's take a look at the scrape_job_links() method, which is listed next:

def scrape_job_links(self):
    self.driver.get(link)

    jobs = []
    pageno = 2

    while True:
        s = BeautifulSoup(self.driver.page_source)
        r = re.compile(r'jobdetail\.ftl\?job=\d+$')

        for a in s.findAll('a', href=r):
            tr = a.findParent('tr')
            td = tr.findAll('td')

            job = {}
            job['title'] = a.text
            job['url'] = urlparse.urljoin(link, a['href'])
            job['location'] = td[2].text
            jobs.append(job)

        next_page_elem = self.driver.find_element_by_id('next')
        next_page_link = s.find('a', text='%d' % pageno)

        if next_page_link:
            next_page_elem.click()
            pageno += 1
            sleep(.75)
        else:
            break

    return jobs

First, we open the page with driver.get(). After get() returns, we feed the rendered HTML indriver.page_source into BeautifulSoup. Then we match against the href attribute of the job links. For each job link we extract the title, url, and location.

To get all of the jobs, we also need to handle pagination. There's a pager at the bottom of the jobs listings. Below is a screenshot of the pager. A user can click a page number or the Next link to navigate through the listings.


We use the Next link to iterate through every page of the results by first finding the Next element using the driver's find_element_by_id method and then calling click() if we're not on the last page.

next_page_elem = self.driver.find_element_by_id('next')
next_page_link = s.find('a', text='%d' % pageno)

if next_page_link:
    next_page_elem.click()
    pageno += 1
else:
    break

To determine if we're on the last page we search for a link whose text equals the current page number plus one. If no such link exists then we've reached the last page of results and break.

If you'd like to see a working version of the code developed in this post, it's available on github here.

Shameless Plug

Have a scraping project you'd like done? I'm available for hire. Contact me with some details about your project and I'll give you a quote.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值