第九周-第三节课
一个小又全的爬虫项目
-
任务生成者
生成爬虫任务的组件, 最大的作用就是建立生产消费者模型, 将生产者和消费者剥离, 可以达到程序暂停重启的功能.
-
配置文件
当前爬虫项目的基础配置信息, 目的就是统一化配置, 避免重复修改.
-
主函数/调度器
以逻辑控制流协同各个组件, 完成爬取工作, 具有一定的调度功能
-
下载器
用来和目标服务器进行交互, 获取数据的组件
-
解析器
用来解析非结构化的页面内容, 获取想要的数据.
-
存储器
用来持久化解析后的数据
- 数据库
- 存为本地文件, 比较推荐的格式为json, 结构严谨的可以保存为csv
项目路径
main.py 代码:
# -*- coding: UTF-8 -*-
"""
@File :main.py
@Author :Super
@Date :2021/2/28
@Desc :
"""
import random
import pymysql
import requests
from parsers.search import parse_jd_item
from settings import MYSQL_CONF, HEADERS
def save(item_array):
"""
持久化保存抓取结果
:param item_array:
:return:
"""
cursor = mysql_con.cursor()
SQL = """INSERT INTO jd_search(sku_id, img, price, title, shop, icons)
VALUES (%s, %s, %s, %s, %s, %s)"""
cursor.executemany(SQL, item_array)
mysql_con.commit()
cursor.close()
def downloader(task):
"""
请求目标网址的组件
:param task:
:return:
"""
url = "https://search.jd.com/Search"
params = {
"keyword": task
}
res = requests.get(url, params=params, headers=HEADERS)
return res
def main(task_array):
"""
爬虫任务的调度
:param task_array:
:return:
"""
for task in task_array:
result = downloader(task)
item_array = parse_jd_item(result.text)
print(item_array)
save(item_array)
if __name__ == '__main__':
mysql_con = pymysql.connect(**MYSQL_CONF)
task_array = ["鼠标","键盘","显卡", "耳机"]
main(task_array)
settings.py代码
# -*- coding: UTF-8 -*-
"""
@File :settings.py
@Author :Super
@Date :2021/2/28
@Desc :当前爬虫项目基础配置文件,目的统一化配置,避免重复修改
"""
HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
"upgrade-insecure-requests": "1"
}
MYSQL_CONF = {
"host": "127.0.0.1",
"user": "root",
"password": "123456",
"db": "world"
}
from parsers.search import parse_jd_item
# -*- coding: UTF-8 -*-
"""
@File :search.py
@Author :Super
@Date :2021/2/28
@Desc :
"""
import json
from bs4 import BeautifulSoup
def parse_jd_item(html):
result = []
soup = BeautifulSoup(html, "lxml")
item_array = soup.select("ul[class='gl-warp clearfix'] li[class='gl-item']")
for item in item_array:
sku_id = item.attrs['data-sku']
img = item.select("img[data-img='1']")
price = item.select("div[class='p-price']")
title = item.select("div[class='p-name p-name-type-2']")
shop = item.select("div[class='p-shop']")
icons = item.select("div[class='p-icons']")
img = img[0].attrs['data-lazy-img'] if img else ""
price = price[0].strong.i.text if price else ""
title = title[0].text.strip() if title else ""
shop = shop[0].span.a.attrs['title'] if shop[0].text.strip() else ""
icons = json.dumps([tag_ele.text for tag_ele in icons[0].select('i')]) if icons else '[]'
result.append((sku_id, img, price, title, shop, icons))
return result
if __name__ == '__main__':
with open("../test/search_jd.html", "r", encoding="utf-8") as f:
html = f.read()
result = parse_jd_item(html)
print(result)