爬虫学习之6:使用XPATH爬取豆瓣TOP500书籍(保存到CSV)

本文介绍了一个简单的Python爬虫项目,用于抓取豆瓣读书TOP250榜单中的书籍信息,包括书名、作者、出版社等,并将数据保存为CSV文件。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

豆瓣读书TOP500页面如下,爬取每本书的书名、作者、出版社、出版日期、价格、星级和评论数。代码简单,本着先抓大、后抓小、寻找循环点的原则编写代码即可,直接附上不做解释。


from lxml import etree
import requests
import csv

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}
fp = open('douban.csv','w+',newline='',encoding='utf-8-sig')
writer = csv.writer(fp)
writer.writerow(('书名','链接','作者','出版社','出版日期','价格','星级','评论数'))
urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
for url in urls:
    html = requests.get(url)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//tr[@class="item"]')
    for info in infos:
        name = info.xpath('td/div/a/@title')[0]
        url = info.xpath('td/div/a/@href')[0]
        book_infos = info.xpath('td/p/text()')[0]
        author = book_infos.split('/')[0]
        publisher = book_infos.split('/')[-3]
        date = book_infos.split('/')[-2]
        price = book_infos.split('/')[-1]
        rate = info.xpath('td/div/span[2]/text()')[0]
        comments = info.xpath('td/div/span[3]/text()')
        comment = comments[0] if len(comments)!=0 else "空"
        writer.writerow((name,url,author,publisher,date,price,rate,comment))
fp.close()

结果保存到CSV中,由于使用了utf-8-sig编码,不会乱码,部分结果如下:


### Python 爬虫抓取豆瓣 Top250 电影数据并保存CSV 文件 为了实现这一目标,可以按照如下方法构建一个完整的流程来完成任务。 #### 创建项目环境与安装依赖库 确保环境中已安装必要的Python包,如`requests`, `lxml` 和 `csv`. 这些工具分别用于发起HTTP请求、解析HTML文档以及处理CSV文件操作。可以通过pip命令轻松安装它们: ```bash pip install requests lxml csv ``` #### 编写爬虫逻辑代码 下面是一个简化版的例子展示如何通过Python编写程序去访问网站页面提取所需的信息,并最终将其存储至本地磁盘上的CSV文件中[^1]。 ```python import csv import requests from lxml import etree def fetch_page(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" } response = requests.get(url, headers=headers) return response.text if response.status_code == 200 else None def parse_html(html_content): tree = etree.HTML(html_content) items = [] titles_cn = tree.xpath('//*[@id="content"]/div/div[1]/ol/li//span[@class="title"][1]/text()') titles_en = tree.xpath('//*[@id="content"]/div/div[1]/ol/li//span[@class="other"]/text()') links = tree.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/@href') directors_and_casts = tree.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/p/text()[1]') release_years = tree.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()') countries_genres_ratings_votes = tree.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/p') for i in range(len(titles_cn)): item = {} # 清洗数据... director_actor_text = ''.join(directors_and_casts[i].split()) country_genre_rating_vote_info = ''.join(countries_genres_ratings.split()) item['movie_title_zh'] = titles_cn[i] item['movie_title_en'] = titles_en[i][3:] if len(titles_en)>i and titles_en[i][:3]=='/' else '' item['detail_link'] = links[i] parts = director_actor_text.strip().split('/') item['director'] = parts[0].strip() item['actors'] = '/'.join(parts[1:]) if len(parts) > 1 else '' info_parts = country_genre_rating_vote_info.strip().replace('\n', '').split('<br/>')[0].split('/') item['year'] = int(release_years[i][1:-1]) if '(' in release_years[i] else '' item['country'] = info_parts[-2].strip() if len(info_parts)>=2 else '' item['genre'] = info_parts[-1].strip() rating_element = tree.xpath(f'//*[@id="content"]/div/div[1]/ol/li[{i+1}]/div/div[2]/div[2]/div/span[2]') vote_count_element = tree.xpath(f'//*[@id="content"]/div/div[1]/ol/li[{i+1}]/div/div[2]/div[2]/div/span[4]') item['rating'] = float(rating_element[0].text) if rating_element else '' item['vote_count'] = int(vote_count_element[0].text[:-3]) if vote_count_element else '' items.append(item) return items def save_to_csv(data_list, filename='./douban_top250.csv'): with open(filename,'w',newline='', encoding='utf-8') as f: fieldnames=['电影中文名', '电影英文名','电影详情页链接','导演','演员','上映年份','国籍','类型','评分','评分人数'] writer=csv.DictWriter(f,fieldnames=fieldnames) writer.writeheader() for row in data_list: try: writer.writerow({ '电影中文名':row['movie_title_zh'], '电影英文名':row['movie_title_en'], '电影详情页链接':row['detail_link'], '导演':row['director'], '演员':row['actors'], '上映年份':row['year'], '国籍':row['country'], '类型':row['genre'], '评分':row['rating'], '评分人数':row['vote_count'] }) except Exception as e: print(e) if __name__=='__main__': base_url = 'https://movie.douban.com/top250?start={}&filter=' all_items = [] for start_num in range(0, 250, 25): url = base_url.format(start_num) html = fetch_page(url) page_data = parse_html(html) all_items.extend(page_data) save_to_csv(all_items) ``` 此段代码实现了从网络上获取指定URL的内容,接着运用XPath表达式定位到具体的标签节点从而抽取有用信息;最后把整理好的记录逐条追加进CSV文件里[^3]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值