单点小说作品库(下)

本文详细介绍了数据库存储模块及数据访问接口的实现过程,包括使用mongodb存储小说数据,以及如何通过接口查询作品、作者、类别和tag相关的小说信息。同时讨论了当前存在的性能瓶颈并计划进行改进。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

上文记录完爬虫和页面解析模块,下面首先介绍存储模块:

鉴于导师要求,存储使用nosql数据库:mongodb。还好此数据库比较好学,笔者在安装后,简单学习了一下,立即写出的存储程序。将小说数据全部存放到一个表中(mongodb成为collection,集合),一个小说建立一条记录(称文档),记录中存放各个字段。程序如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
File: save_for17k.py
Author: civis
Date: 2015/01/22 20:09:49
'''

import os

import pymongo

def save(domain="17k"):
    """save information for "domain" in mongodb"""
    #Open a connection to mongodb
    count = 0
    conn = pymongo.Connection(host='localhost', port=27017)
    db = conn.novel_pool_db #database
    novel_pool = db.novel_pool #novel_pool collection

    #handle each info file in domain directory
    path = 'novel/' + domain
    files = []
    if os.path.isdir(path):
        files = os.listdir(path)
    else:
        print "*** Error: Invalid path: %s ***" % path
        return
    for fn in files:
        dict = {}
        dict['introduction']=""
        try:
            fl = open(path+'/'+fn, 'r')
            all = fl.readlines() #lines is a list with every line in it
        except IOError, e:
            print "*** Error: Read file fault: %s %s***" % (fn, str(e))
            continue
        if len(all) < 10:
            print "*** Bad novel introduction, continue"
            continue

        dict['title'] = all[0].split(':')[1].strip()
        dict['author'] = all[1].split(':')[1].strip()
        dict['tags'] = all[-1].split(':')[1].strip()
        dict['score'] = all[-2].split(':')[1].strip()
        dict['links'] = all[-3].split(':')[1].strip()
        dict['wordcount'] = all[-4].split(':')[1].strip()
        dict['image'] = all[-5].split(':')[1].strip()
        dict['category'] = all[-6].split(':')[1].strip()

        for i in range(3,len(all)-6):
            dict['introduction'] += all[i]

        new_novel = dict
        try:
            novel_pool.insert(new_novel)
            count += 1
            print '( ', count, ' )'
            print "Success Insert! %s" % dict['title'] 
        except Exception, e:
            print "*** Error: insert fault!" +str(e)

if __name__ == '__main__':
    domain = "17k"
    save(domain)
程序使用pymogo库操作mongodb数据库,性能还可以,但是频繁访问文件还是拖慢了速度。强烈感觉应该在页面解析模块将解析的小说信息使用json格式数据存放在一个大文件中。

下面介绍数据访问接口模块。数据访问接口需要实现五个功能:

a)        查询作品名为xxx的小说信息;

b)        查询作者为xxx的小说列表,并按热度排序;

c)        查询类别为xxx的小说列表,并按热度排序;

d)        查询tag为xxx的小说列表,并按相关度排序;

e)        作者排名,给出作者名、简介,以及热度。

程序如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
File: visit_novel.py
Author: civis
Date: 2015/01/23 14:19:44
'''
import pymongo

conn = pymongo.Connection('localhost', 27017)
db = conn.novel_pool_db
novel_pool = db.novel_pool

def get_total_number():
    """return total number novels of this database"""
    return novel_pool.find({}).count();

def get_novel_by_name(name="Life of Civis"):
    """return information about novel: 'name'; 
       return a dictionary of this novel or 
       return 'None information'
    
    """
    intro = novel_pool.find_one({'title': name})
    if intro is not None:
        return intro
    return "None information"

def get_novel_by_author(author="Civis"):
    """return novel writen by $author;
       return a list with all novels in it
       orded by score
    """
    list = []
    novels = novel_pool.find({'author': author}).sort('score', -1)
    for novel in novels:
        list.append(novel)
    return list

def get_novel_by_categ(category="autobiography"):
    """return novels belong to this category;
       return a list with novel information in it
       orded by score
    """
    list = []
    novels = novel_pool.find({'category':category}).sort('score', -1)
    for novel in novels:
        list.append(novel)
    return list

def get_author_info(author="Civis"):
    """return total score and works list 
       of authro;
    """
    works = []
    novels = novel_pool.find({'author':author})
    score = 0
    for novel in novels: 
        try:
            works.append(novel['title'])
            s = novel['score']
        except KeyError, e:
            continue
        if isinstance(novel['score'],int):
            score += novel['score']
    return score, works

接口e还是不能完全实现,主要因为数据库中数据存放在一个页面,不能支持这样的查找,后续再研究。

总结:练习项目总结这些,主要记录一下自己上个周的工作。还有好多需要改进的地方,接下来一周将对crawler存储方式、页面解析查找页面方式、页面解析数据存储方式等性能相关问题做改进。




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值