selenium+PhantomJS小案例—爬豆瓣网所有电影代码python

最新推荐文章于 2025-09-12 15:58:40 发布

weixin_30877755

最新推荐文章于 2025-09-12 15:58:40 发布

阅读量201

点赞数

CC 4.0 BY-SA版权

文章标签： python

原文链接：http://www.cnblogs.com/reyinever/p/9250467.html

本文介绍了一个使用Python Selenium库实现的豆瓣电影爬虫程序。该程序通过PhantomJS浏览器驱动加载页面并抓取数据，实现了点击展开更多内容的功能，并将获取到的电影封面、名称及评分等信息保存至本地文件中。

#coding=utf-8
from selenium import webdriver

def crawMovie():
    driver=webdriver.PhantomJS()
    driver.get("https://movie.douban.com/")
    movie_list=[]
    more_btn=driver.find_element_by_xpath('(//a[@class="more-link"])[1]')
    more_btn.click()

    while True:
        start_index=len(movie_list)
        xpath_str='//a[@class="item"][position()>%d]'%start_index
        item_tags=driver.find_elements_by_xpath(xpath_str)
        print "start_index:",start_index
        print item_tags
        print "number:",len(item_tags)
        for item_tag in item_tags:
            img_tag=item_tag.find_element_by_tag_name('img')
            cover=img_tag.get_attribute("src")
            title=img_tag.get_attribute("alt")
            rating=item_tag.find_element_by_xpath(".//p/strong").text

            movie="cover:%s,title:%s,rating:%s"%(cover,title,rating)
            #print "movie:",type(movie),movie

            print u"电影名："+title
            movie_list.append(movie.encode("gbk")+"\n")
        print "--"*20
        load_more_btn=driver.find_element_by_xpath('//a[@class="more"]')
        if load_more_btn.get_attribute("style"):
            break
        load_more_btn.click()

    with open("e:\\movie_list.txt","w") as fp:
        fp.writelines(movie_list)

if __name__=="__main__":
    crawMovie()