项目五：获取数据：当当网

最新推荐文章于 2022-09-15 16:31:28 发布

&黄焖鸡米饭&

最新推荐文章于 2022-09-15 16:31:28 发布

阅读量252

点赞数 2

分类专栏：数据分析爬虫笔记文章标签： python 数据分析 html

本文链接：https://blog.youkuaiyun.com/qq_42066782/article/details/114186154

版权

数据分析同时被 3 个专栏收录

13 篇文章

订阅专栏

笔记

12 篇文章

订阅专栏

爬虫

6 篇文章

订阅专栏

本篇博客记录程序点击跳转项目，旨在学习RFM模型。作者原打算针对淘宝、京东等购物网站，但因自身水平有限，退而求其次选择爬取当当网。参考编写爬虫后，发现获取的数据不理想，最终放弃。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬取当当网

直接上代码

点击跳转到总目录
 本篇只记录程序点击跳转项目

直接上代码

此项目主要为了学习RFM模型，R（近度）F（频度）M（额度），我把目标盯上了淘宝，京东这种购物网站，但是水平太菜，于是退而其次打算爬取当当网试试
参考写了这个爬虫之后发现获取到的数据并不理想，故放弃

import requests
from lxml import etree
import pandas as pd

test_url = 'http://search.dangdang.com/?key='+ '数据分析'
content_page = requests.get(test_url).text       #3. 执行页面请求，返回页面内容
print(content_page[:1000])                       #4.将页面的前1000个字符打印显示出来

# # from lxml import etree
# page = etree.HTML(content_page)
# book_name = page.xpath('//li/p/a[@name="itemlist-title"]/@title') #用xpath提取出书名信息。
# # book_name[:10]
# print(book_name[:10])

# from lxml import etree
def content(content_page):
    books = []
    page = etree.HTML(content_page)
    book_name = page.xpath('//li/p/a[@name="itemlist-title"]/@title') #书名
    pub_info = page.xpath('//li/p[@class="search_book_author"]')#出版信息
    pub_info = [book_pub.xpath('string(.)') for book_pub in pub_info]
    price_now = page.xpath('//li//span[@class="search_now_price"]/text()')#当前价格
    stars = page.xpath('//li/p[@class="search_star_line"]/span[@class="search_star_black"]/span/@style') #星级
    comment_num = page.xpath('//li/p[@class="search_star_line"]/a[@class="search_comment_num"]/text()') #评论数
    for book in zip(book_name, pub_info, price_now, stars, comment_num):
        books.append(list(book))
    return books
test_url = 'http://search.dangdang.com/?key='+ '数据分析'
content_page = requests.get(test_url).text       #3. 执行页面请求，返回页面内容
# print(content_page[:1000])                       #4.将页面的前1000个字符打印显示出来

books = content(content_page)
# books[:5]
print(books[:5])

# import pandas as pd
books_df = pd.DataFrame(data=books,columns=["书名","出版信息","当前价格","星级","评论数"])
# books_df[:10]
# print(books_df[:10])

books_df.to_csv("./2.csv",encoding="utf8",sep="\t",index=None)
books_df.to_excel("./2.xlsx",sheet_name="sheet1",index=False,encoding="utf-8")