昨天模拟了csdn的登录和博客的抓取,今天通过python将抓取的博客标题和链接加载到数据库,具体如下
1. 测试数据库的连通性, 详细代码如下
def testDBConnection(self):
"""test db connection"""
db = pymysql.connect(host="localhost",user="root",password="xxxx",db="python", port="3306", charset="utf8" )
cursor = db.cursor()
cursor.execute("SELECT VERSION()")
data = cursor.fetchone()
print ("Database version : %s " % data)
db.close()
2. 数据库能够连接后,加载数据,只是简单的insert, 这里需要注意的地方是中文入库问题,要在db连接加上charset="utf8",创建数据库的时候要指定字符集为utf8, 创建表的时候也要指定字符集为utf8
简单的创建数据库和建表语句
CREATE DATABASE `python` CHARACTER SET utf8 COLLATE utf8_general_ci;
create table articles(id integer primary key auto_increment,
title varchar(1000) not null,
url varchar(1000) not null,
comments int,
thumbs int
)ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;
实现数据加载的db操作如下:
def insertArticles(self, title, url, comments, thumbs):
"""store new article to database"""
db = pymysql.connect(host="localhost",user="root",password="xxxx",db="python", port=3306, charset="utf8" )
sqlStr = 'insert into articles (title, url, comments, thumbs) values("%s", "%s", "%d", "%d")' % (title, url, comments, thumbs)
print(sqlStr)
try:
cursor = db.cursor()
cursor.execute("SET NAMES utf8")
cursor.execute(sqlStr)
db.commit()
db.close()
print("insert successfully!!")
except:
print("insert failed!!")
db.rollback()
db.close()
3. 在昨天的抓取模块中导入db,从而调用db中的函数
from DBHelper import DBHelper
def readArticles(self):
"""Get article"""
blog_count, page_count = self._get_blog_count()
for index in range(1, page_count + 1):
url = 'http://write.blog.youkuaiyun.com/postlist/0/0/enabled/' + str(index)
print(url)
response = self._session.get(url)
page = BeautifulSoup(response.text, 'lxml')
links = page.find_all('a', href=re.compile(r'http://blog.youkuaiyun.com/flsmgf/article/details/(\d+)'))
for link in links:
blog_name = link.string
blog_url = link['href']
#print(blog_name +","+ blog_url)
db = DBHelper()
db.insertArticles(blog_name, blog_url, 1, 1)
4. 查询数据库结果如图: