上文记录完爬虫和页面解析模块,下面首先介绍存储模块:
鉴于导师要求,存储使用nosql数据库:mongodb。还好此数据库比较好学,笔者在安装后,简单学习了一下,立即写出的存储程序。将小说数据全部存放到一个表中(mongodb成为collection,集合),一个小说建立一条记录(称文档),记录中存放各个字段。程序如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
File: save_for17k.py
Author: civis
Date: 2015/01/22 20:09:49
'''
import os
import pymongo
def save(domain="17k"):
"""save information for "domain" in mongodb"""
#Open a connection to mongodb
count = 0
conn = pymongo.Connection(host='localhost', port=27017)
db = conn.novel_pool_db #database
novel_pool = db.novel_pool #novel_pool collection
#handle each info file in domain directory
path = 'novel/' + domain
files = []
if os.path.isdir(path):
files = os.listdir(path)
else:
print "*** Error: Invalid path: %s ***" % path
return
for fn in files:
dict = {}
dict['introduction']=""
try:
fl = open(path+'/'+fn, 'r')
all = fl.readlines() #lines is a list with every line in it
except IOError, e:
print "*** Error: Read file fault: %s %s***" % (fn, str(e))
continue
if len(all) < 10:
print "*** Bad novel introduction, continue"
continue
dict['title'] = all[0].split(':')[1].strip()
dict['author'] = all[1].split(':')[1].strip()
dict['tags'] = all[-1].split(':')[1].strip()
dict['score'] = all[-2].split(':')[1].strip()
dict['links'] = all[-3].split(':')[1].strip()
dict['wordcount'] = all[-4].split(':')[1].strip()
dict['image'] = all[-5].split(':')[1].strip()
dict['category'] = all[-6].split(':')[1].strip()
for i in range(3,len(all)-6):
dict['introduction'] += all[i]
new_novel = dict
try:
novel_pool.insert(new_novel)
count += 1
print '( ', count, ' )'
print "Success Insert! %s" % dict['title']
except Exception, e:
print "*** Error: insert fault!" +str(e)
if __name__ == '__main__':
domain = "17k"
save(domain)
程序使用pymogo库操作mongodb数据库,性能还可以,但是频繁访问文件还是拖慢了速度。强烈感觉应该在页面解析模块将解析的小说信息使用json格式数据存放在一个大文件中。
下面介绍数据访问接口模块。数据访问接口需要实现五个功能:
a) 查询作品名为xxx的小说信息;
b) 查询作者为xxx的小说列表,并按热度排序;
c) 查询类别为xxx的小说列表,并按热度排序;
d) 查询tag为xxx的小说列表,并按相关度排序;
e) 作者排名,给出作者名、简介,以及热度。
程序如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
File: visit_novel.py
Author: civis
Date: 2015/01/23 14:19:44
'''
import pymongo
conn = pymongo.Connection('localhost', 27017)
db = conn.novel_pool_db
novel_pool = db.novel_pool
def get_total_number():
"""return total number novels of this database"""
return novel_pool.find({}).count();
def get_novel_by_name(name="Life of Civis"):
"""return information about novel: 'name';
return a dictionary of this novel or
return 'None information'
"""
intro = novel_pool.find_one({'title': name})
if intro is not None:
return intro
return "None information"
def get_novel_by_author(author="Civis"):
"""return novel writen by $author;
return a list with all novels in it
orded by score
"""
list = []
novels = novel_pool.find({'author': author}).sort('score', -1)
for novel in novels:
list.append(novel)
return list
def get_novel_by_categ(category="autobiography"):
"""return novels belong to this category;
return a list with novel information in it
orded by score
"""
list = []
novels = novel_pool.find({'category':category}).sort('score', -1)
for novel in novels:
list.append(novel)
return list
def get_author_info(author="Civis"):
"""return total score and works list
of authro;
"""
works = []
novels = novel_pool.find({'author':author})
score = 0
for novel in novels:
try:
works.append(novel['title'])
s = novel['score']
except KeyError, e:
continue
if isinstance(novel['score'],int):
score += novel['score']
return score, works
接口e还是不能完全实现,主要因为数据库中数据存放在一个页面,不能支持这样的查找,后续再研究。
总结:练习项目总结这些,主要记录一下自己上个周的工作。还有好多需要改进的地方,接下来一周将对crawler存储方式、页面解析查找页面方式、页面解析数据存储方式等性能相关问题做改进。