python语言网络爬虫学习（二）-优快云博客

本文链接：https://blog.youkuaiyun.com/zou_gr/article/details/107224235

内容简介

这一章节主要写如何将python爬取的内容存储在json文件和mysql数据库。

存储为json文件

我们使用BeautifulSoup解析网页，爬取搜狗微信热点榜前十的内容和指向的网址，并存储为json文件，python保存json文件需要使用到json库。文件自然而然就会保存在当前工作路劲。

import requests
from bs4 import BeautifulSoup

rqq = requests.get('https://weixin.sogou.com/') #HTTP请求
soup = BeautifulSoup(rqq.content, 'lxml') #转换格式
soup.select('#topwords')  #可以查看id属性为topwords的
dat = soup.select('.hot-news > li > a') 
#[i.text for i in soup.select('.hot-news > li > a')] #提取内容
#[i['title'] for i in soup.select('.hot-news > li > a')]

names = [i.text for i in dat]
href = [i['href'] for i in dat]  #提取指向网址
print(names, href)

import json
with open('./temp.json', 'w') as f:
    json.dump({'names': names, 'href': href}, f, ensure_ascii=False)

可以使用记事本或者notepad++打开文件都没有问题，并且json文件就是以字典的形式展现的。
在这里插入图片描述

存储在mysql数据库

我们想要将爬取信息存储在数据库，得先将数据转变为dataframe形式，然后利用pymysql，使用create_engine函数，建立一个数据库连接。其中create_engine构造形式：
数据库产品名+连接工具名：//用户名：密码@数相库IP地址：数据库端口号数据库名称？charset=数据库数据编码
数据存储用pandas的to_sql方法：
DataFrame.to_sql（name，con，schema=None，ifexists=‘fail’，index=True，index_label=None，dtype=None）
在这里插入图片描述

from sqlalchemy import create_engine
con = create_engine('mysql+pymysql://root:123@localhost/test?charset=utf8')
import pandas as pd
pd.DataFrame({'names': names, 'href': href}).to_sql('temp29', con)  #需要提前创建数据库