单点小说作品库（下）

最新推荐文章于 2025-12-03 12:28:21 发布

原创最新推荐文章于 2025-12-03 12:28:21 发布 · 622 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #mongodb

Python 同时被 2 个专栏收录

2 篇文章

订阅专栏

工作记录

2 篇文章

订阅专栏

本文详细介绍了数据库存储模块及数据访问接口的实现过程，包括使用mongodb存储小说数据，以及如何通过接口查询作品、作者、类别和tag相关的小说信息。同时讨论了当前存在的性能瓶颈并计划进行改进。

上文记录完爬虫和页面解析模块，下面首先介绍存储模块：

鉴于导师要求，存储使用nosql数据库：mongodb。还好此数据库比较好学，笔者在安装后，简单学习了一下，立即写出的存储程序。将小说数据全部存放到一个表中（mongodb成为collection，集合），一个小说建立一条记录（称文档），记录中存放各个字段。程序如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
File: save_for17k.py
Author: civis
Date: 2015/01/22 20:09:49
'''

import os

import pymongo

def save(domain="17k"):
    """save information for "domain" in mongodb"""
    #Open a connection to mongodb
    count = 0
    conn = pymongo.Connection(host='localhost', port=27017)
    db = conn.novel_pool_db #database
    novel_pool = db.novel_pool #novel_pool collection

    #handle each info file in domain directory
    path = 'novel/' + domain
    files = []
    if os.path.isdir(path):
        files = os.listdir(path)
    else:
        print "*** Error: Invalid path: %s ***" % path
        return
    for fn in files:
        dict = {}
        dict['introduction']=""
        try:
            fl = open(path+'/'+fn, 'r')
            all = fl.readlines() #lines is a list with every line in it
        except IOError, e:
            print "*** Error: Read file fault: %s %s***" % (fn, str(e))
            continue
        if len(all) < 10:
            print "*** Bad novel introduction, continue"
            continue

        dict['title'] = all[0].split(':')[1].strip()
        dict['author'] = all[1].split(':')[1].strip()
        dict['tags'] = all[-1].split(':')[1].strip()
        dict['score'] = all[-2].split(':')[1].strip()
        dict['links'] = all[-3].split(':')[1].strip()
        dict['wordcount'] = all[-4].split(':')[1].strip()
        dict['image'] = all[-5].split(':')[1].strip()
        dict['category'] = all[-6].split(':')[1].strip()

        for i in range(3,len(all)-6):
            dict['introduction'] += all[i]

        new_novel = dict
        try:
            novel_pool.insert(new_novel)
            count += 1
            print '( ', count, ' )'
            print "Success Insert! %s" % dict['title'] 
        except Exception, e:
            print "*** Error: insert fault!" +str(e)

if __name__ == '__main__':
    domain = "17k"
    save(domain)

程序使用pymogo库操作mongodb数据库，性能还可以，但是频繁访问文件还是拖慢了速度。强烈感觉应该在页面解析模块将解析的小说信息使用json格式数据存放在一个大文件中。

下面介绍数据访问接口模块。数据访问接口需要实现五个功能：

a) 查询作品名为xxx的小说信息；

b) 查询作者为xxx的小说列表，并按热度排序；

c) 查询类别为xxx的小说列表，并按热度排序；

d) 查询tag为xxx的小说列表，并按相关度排序；

e) 作者排名，给出作者名、简介，以及热度。

程序如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
File: visit_novel.py
Author: civis
Date: 2015/01/23 14:19:44
'''
import pymongo

conn = pymongo.Connection('localhost', 27017)
db = conn.novel_pool_db
novel_pool = db.novel_pool

def get_total_number():
    """return total number novels of this database"""
    return novel_pool.find({}).count();

def get_novel_by_name(name="Life of Civis"):
    """return information about novel: 'name'; 
       return a dictionary of this novel or 
       return 'None information'
    
    """
    intro = novel_pool.find_one({'title': name})
    if intro is not None:
        return intro
    return "None information"

def get_novel_by_author(author="Civis"):
    """return novel writen by $author;
       return a list with all novels in it
       orded by score
    """
    list = []
    novels = novel_pool.find({'author': author}).sort('score', -1)
    for novel in novels:
        list.append(novel)
    return list

def get_novel_by_categ(category="autobiography"):
    """return novels belong to this category;
       return a list with novel information in it
       orded by score
    """
    list = []
    novels = novel_pool.find({'category':category}).sort('score', -1)
    for novel in novels:
        list.append(novel)
    return list

def get_author_info(author="Civis"):
    """return total score and works list 
       of authro;
    """
    works = []
    novels = novel_pool.find({'author':author})
    score = 0
    for novel in novels: 
        try:
            works.append(novel['title'])
            s = novel['score']
        except KeyError, e:
            continue
        if isinstance(novel['score'],int):
            score += novel['score']
    return score, works

接口e还是不能完全实现，主要因为数据库中数据存放在一个页面，不能支持这样的查找，后续再研究。

总结：练习项目总结这些，主要记录一下自己上个周的工作。还有好多需要改进的地方，接下来一周将对crawler存储方式、页面解析查找页面方式、页面解析数据存储方式等性能相关问题做改进。