Python爬虫实战：爬取人民日报数据写入Mysql

本文链接：https://blog.youkuaiyun.com/stevenyu1986/article/details/102689768

一、写这个爬虫的目的：
学了一段时间的Python、SQL语句，主要想通过这个爬虫熟悉静态网页站点的爬取，以及将爬取数据在数据库中进行操作的过程。
二、目标网站分析：
人民日报网站是静态网站，其页面跳转是通过 URL 的改变完成的，即所有数据一开始就是加载好的。我们只需要去 html 中提取相应的数据即可，不涉及到诸如 Ajax 这样的动态加载方法。
三、用到的主要第三方库：
通过上述分析，主要用了requests、lxml、pymysql、datetime这几个第三方库。其中，requests 库主要用来发起请求及接受响应信息，lxml主要通过Xpath方法来解析html内容，pymysql主要用于将爬取的数据写入Mysql数据库。
四、代码：

import requests
from lxml import etree
import pymysql
from datetime import datetime,timedelta
import time

def download_people_daily(year, month, day):
    
    #获取目标网页的 html 内容：
    def get_html_text(url):
        headers={
   
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
        html=requests.get(url,headers=headers)
        html.raise_for_status()
        html.encoding="utf-8"
        return etree.HTML(html.text)

    url='http://paper.people.com.cn/rmrb/html/'  + year + '-' + month + '/' + day + '/' +'nbs.D110000renmrb_01.htm'
 
    #爬取当天报纸的各版面的链接，将其保存为一个数组，并返回:
    def get_page_link(year,month,day):    
        selector1=get_html_text(url)
        temp_pagelink=selector1