菜鸟的Django+MongoDB搭建简单爬虫显示网站踩坑心得

本文基于《Django企业开发实战》一书和网络资料,拼拼凑凑,做了简单的网站实现,大致把前后端+数据库方面的内容结合爬虫代码做通了。不得不说,Django和MongoDB是真的不搭配,强烈建议不要使用。

开始:conda环境

本文在Ubuntu18.04LTS系统上,使用anaconda搭建虚拟环境,python版本为3.6.10,主要库版本为django(1.11.4),mongoengine(0.20.0),pymongo(3.10.1),以及爬虫用的requests(2.24.0)和beautifulsoup4(4.9.1)。爬虫部分详见上一篇文章的内容,这边稍作改造直接拿来用。

第一部分:Django与MongoDB的适配

关于Django项目的创建此处先略,建议直接参考《Django企业开发实战》这本的第3章,具体内容有空再补。

最终本文成品的文件结构如下(最外边还有一个search_project文件夹,是禁忌的三重套娃):

在本地安装好MongoDB以及创建好Django项目后,首先是在settings.py中屏蔽掉默认非常好使的sqlite,转而添加上必须由第三方支持才能使用而且资料极少的MongoDB依赖项,主要参考来自于Django 数据库操作(MongoDB+Django)Django+MongoDBScrapy爬取数据存储到Mongodb数据库python+django从mongo读取数据和图片展示在htmlDjango 框架之 使用MangoDB数据库这几篇博客。

我的settings.py文件修改完成后如下:

"""
Django settings for search_project project.

Generated by 'django-admin startproject' using Django 1.11.4.

For more information on this file, see
https://docs.djangoproject.com/en/1.11/topics/settings/

For the full list of settings and their values, see
https://docs.djangoproject.com/en/1.11/ref/settings/
"""

import os
import mongoengine

# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/1.11/howto/deployment/checklist/

# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = '**^!7vv0ylwqme#^a6zs*$x)zzwgff-(u_6va%q@u-_g5wz8q#'#项目创建生成的key,每个都不一样

# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = False

ALLOWED_HOSTS = ['*']


# Application definition
#show是我的app名,而mongoengine也作为第三方组件需要被当做app
INSTALLED_APPS = [
    'show',
    'mongoengine',

    'django.contrib.admin',
    'django.contrib.auth',

    'rest_framework.authtoken',

    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
]
#mongodb数据库的名字,地址,以及??
MONGODB_DATABASES = {
    "default": {
        "name": "search_record",
        "host": '127.0.0.1',
        'tz_aware': True,
    }
}

MIDDLEWARE = [
    'django.middleware.security.SecurityMiddleware',
    'django.contrib.sessions.middleware.SessionMiddleware',
    'django.middleware.common.CommonMiddleware',
    'django.middleware.csrf.CsrfViewMiddleware',
    'django.contrib.auth.middleware.AuthenticationMiddleware',
    'django.contrib.messages.middleware.MessageMiddleware',
    'django.middleware.clickjacking.XFrameOptionsMiddleware',
]

ROOT_URLCONF = 'search_project.urls'

TEMPLATES = [
    {
        'BACKEND': 'django.template.backends.django.DjangoTemplates',
        'DIRS': [],
        'APP_DIRS': True,
        'OPTIONS': {
            'context_processors': [
                'django.template.context_processors.debug',
                'django.template.context_processors.request',
                'django.contrib.auth.context_processors.auth',
                'django.contrib.messages.context_processors.messages',
            ],
        },
    },
]

WSGI_APPLICATION = 'search_project.wsgi.application'


# Database
# https://docs.djangoproject.com/en/1.11/ref/settings/#databases
#注释掉sqlite部分
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.dummy',
        #'ENGINE': 'django.db.backends.sqlite3',
        #'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
    }
}
#连接数据库
mongoengine.connect('search_record', host='127.0.0.1', port=27017)

# Password validation
# https://docs.djangoproject.com/en/1.11/ref/settings/#auth-password-validators

AUTH_PASSWORD_VALIDATORS = [
    {
        'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',
    },
    {
        'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',
    },
    {
        'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',
    },
    {
        'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',
    },
]


# Internationalization
# https://docs.djangoproject.com/en/1.11/topics/i18n/

LANGUAGE_CODE = 'zh-hans'

TIME_ZONE = 'Asia/Shanghai'

USE_I18N = True

USE_L10N = True

USE_TZ = True

# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/1.11/howto/static-files/

STATIC_URL = '/static/'

第二部分:Model创建

由于需求很简单,也没有涉及管理员和用户系统(Django+MongDB在这方面非常对新手而言非常不友好),所以只需要创建两个用于存储不同网站爬虫信息的模型即可。在show这个app里的models.py文件中创建两个class,分别对应江苏和全国的爬虫数据格式。

from django.db import models
import datetime
from mongoengine import *

# Create your models here.
"""
为了方便日期时间等存储,还是建议将相关类设置为DateTimeField格式,
这个可以由datetime库的数据类型直接存储,不需要做转换,
此外mate中ordering为排序key,这边取负数“-”倒序,
务必确认这里属性的名称和实际数据库中的名称一致(不然网页会报错500)
"""
class JiangSu_Record(Document):
    title = StringField(max_length=2048, required=True)
    url = StringField(max_length=2048, required=True)
    start_time = DateTimeField(required=True)
    deadline = DateTimeField(required=True)
    issue_time = DateTimeField(required=True)
    project_type = StringField(max_length=32, required=True)
    created_time = DateTimeField(default=datetime.datetime.now(), required=True)
    meta = {'collection': 'JiangSu', 'ordering': ['-created_time'], }
#江苏的相关信息包括标题,链接,开始时间,截止时间,发布时间,项目类型和创建时间,存在数据库的JiangSu这个集合中

class Global_Record(Document):
    title = StringField(max_length=2048, required=True)
    url = StringField(max_length=2048, required=True)
    area_city = StringField(max_length=64, required=True)
    issue_date = DateTimeField(required=True)
    created_time = DateTimeField(default=datetime.datetime.now(), required=True)
    meta = {'collection': 'Global', 'ordering': ['-created_time'], }
#全国的相关信息包括标题,链接,区域,发布时间和创建时间,存在数据库的Global这个集合中

由于Django不支持MongoDB下的admin设置,所以admin.py的设置直接略去,同理,创建数据库迁移文件,创建表和创建超级用户等过程也略去。

第三部分:view和html以及URL配置

在app的views.py文件中创建相应网页的视图函数,分别对应主页,江苏查询页和全国查询页。实现MongoDB数据库操作的QuerySet接口是Django自带的,可以大致参考QuerySet API reference以及其他博文。

from django.shortcuts import render
import datetime
from .models import JiangSu_Record, Global_Record
import os, django
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "search_project.settings")# project_name 项目名称
django.setup()
from .crawler_jiangsu import is_date, visit_jiangsu_home
from .crawler_global_user import update_or_not
#直接使用render渲染页面
def index(request):
    return render(request, 'index.html')

def jiangsu_result(request):
    visit_jiangsu_home()#每次都先爬取,如果是post操作则会读取post数据进行相应查询操作
    if request.method == "POST":
        jiangsu_key_word = request.POST.get('jiangsu_key_word', None)
        jiangsu_date = request.POST.get('jiangsu_date', None)
        jiangsu_date = is_date(jiangsu_date)
        target_results = JiangSu_Record.objects.all()
        if jiangsu_key_word is not None:#匹配多关键词
            keys = jiangsu_key_word.split(" ")
            for key in keys:
                target_results = target_results.filter(title__contains=key)
        else:
            pass

        if jiangsu_date is not None:
            target_results = target_results.filter(deadline__gte=jiangsu_date)
        else:
            pass
        context = {'results': target_results}
        return render(request, 'jiangsu_result.html', context=context)
    else:
        jiangsu_results = JiangSu_Record.objects.all()
        context = {'results': jiangsu_results}
        return render(request, 'jiangsu_result.html', context=context)


def global_result(request):
    update_or_not()
    if request.method == "POST":
        global_key_word = request.POST.get('global_key_word', None)
        global_date = request.POST.get('global_date', None)
        global_date = is_date(global_date)
        target_results = Global_Record.objects.all()
        if global_key_word is not None:
            keys = global_key_word.split(" ")
            for key in keys:
                target_results = target_results.filter(title__contains=key)
        else:
            pass
        if global_date is not None:
            target_results = target_results.filter(issue_date__gte=global_date)
        else:
            pass
        target_results = target_results[:100]#只显示前100行数据(最新创建的100行)
        context = {'results': target_results}
        return render(request, 'global_result.html', context=context)
    else:
        global_results = Global_Record.objects.all()[:100]
        context = {'results': global_results}
        return render(request, 'global_result.html', context=context)

在写好view.py以后,在app目录下创建templates文件(注意:有“s”),在这个文件夹中创建相应的网页文件,使view中的函数在渲染时能直接调用这些网页。

本文需要3个页面,分别是主页index.html:

<!DOCTYPE html>
<head>
    <title>首页</title>
</head>
<body>
    <hr/>
        <a target="_blank" href="/jiangsu_result/">江苏搜索</a>
    <hr/>
        <a target="_blank" href="/global_result/">全国搜索</a>
    <hr/>
</body>
</html>

这个页面只是实现简单的跳转功能。

江苏爬虫查询页面jiangsu_result.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>江苏-搜索结果</title>
</head>
<body>
    <hr/>
    <form action="/jiangsu_result/" method="post" align="center">
        {% csrf_token %}
        关键字:<input type="text" name="jiangsu_key_word" />
        公告截止日期:<input type="text" name="jiangsu_date" />
        <input type="submit" value="江苏搜索" />
    </form>
    <hr/>
    <table border="1" align="center">
        <tbody>
        <tr>
            <th>公告名称</th>
            <th>公告截止日期</th>
            <th>公告发布日期</th>
            <th>项目类型</th>
            <th>公告链接</th>
        </tr>
            {% for item in results %}
                <tr>
                    <td width="1000" align="left">{{ item.title }}</td>
                    <td width="250" align="center">{{ item.deadline }}</td>
                    <td width="250" align="center">{{ item.issue_time }}</td>
                    <td width="200" align="center">{{ item.project_type }}</td>
                    <td width="100" align="center"><a href={{ item.url }}>链接</a></td>
                </tr>
            {% endfor %}
        </tbody>
    </table>
</body>
</html>

此处的results就是views.py中render函数context中的内容。

全国爬虫查询页面global_result.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>全国-搜索结果</title>
</head>
<body>
    <hr/>
    <form action="/global_result/" method="post" align="center">
        {% csrf_token %}
        关键字:<input type="text" name="global_key_word" />
        起始公布日期:<input type="text" name="global_date" />
        <input type="submit" value="全国搜索" />
    </form>
    <hr/>
    <table border="1" align="center">
        <tbody>
        <tr>
            <th>公告名称</th>
            <th>地区</th>
            <th>公告发布日期</th>
            <th>公告链接</th>
        </tr>
            {% for item in results %}
                <tr>
                    <td width="1000" align="left">{{ item.title }}</td>
                    <td width="150" align="center">{{ item.area_city }}</td>
                    <td width="250" align="center">{{ item.issue_date }}</td>
                    <td width="100" align="center"><a href={{ item.url }}>链接</a></td>
                </tr>
            {% endfor %}
        </tbody>
    </table>
</body>
</html>

最后进行URL配置,即对settings.py同级的urls.py文件进行修改:

from django.conf.urls import url
from django.contrib import admin
from django.contrib.staticfiles.urls import staticfiles_urlpatterns
from show import views
#分别对应3个界面的3个映射
urlpatterns = [
    url(r'^$', views.index, name='index'),
    url(r'^jiangsu_result', views.jiangsu_result, name='jiangsu_result'),
    url(r'^global_result', views.global_result, name='global_result'),
    url(r'^admin/', admin.site.urls),
]
urlpatterns += staticfiles_urlpatterns()#此处意义未理解

第四部分:完成

回到项目文件目录下,在命令行输入python manage.py runserver启动项目,访问http://127.0.0.1:8000即可以访问主页。

其他:

对上一篇中的爬虫程序做了一些修改,补充了相应的数据库操作和后台自动更新数据库等功能,代码如下。

其中江苏数据更新量少,可以每次查询前先更新一下,不需要在后台不断更新:

# -*- coding: utf-8 -*-
import requests, json
from bs4 import BeautifulSoup
import time
import datetime
import sys
from pymongo import MongoClient, InsertOne
import re

def visit_jiangsu_home(x=10):#默认找前10页
    #模拟头部分
    headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Content-Length': '14',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Cookie': 'JSESSIONID=6938F495DAA5F25B2E458C7AB108BEDF',
    'Host': '218.2.208.144:8094',
    'Origin': 'http://218.2.208.144:8094',
    'Referer': 'http://218.2.208.144:8094/EBTS/publish/announcement/paglist1?type=1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
    }
    
    #模拟form data部分
    form_data = {
    'page':'1',
    'rows': '10'
    }
    #空字典存放信息
    #home_dict = {}
    #从第一页开始
    times = 1
    repeat_num = 0
    while times <= x:
        form_data['page'] = times
        #注意url是Name下的链接,不是网址
        url = 'http://218.2.208.144:8094/EBTS/publish/announcement/getList?placard_type=1'
        response = requests.post(url, data=form_data, headers=headers, stream=True)
        #请求到的信息,解析后选择转为字符串形式方便无脑操作
        soup = BeautifulSoup(response.content, "html5lib")
        soup = str(soup)
        for i in range(1,11):#由于每页有10行即10个信息,故分可以根据关键字分为11段,其中有效信息在第2到第11段
            id = soup.split('placard_id":"')[i].split('","project_id')[0]#提取公告ID
            mark = soup.split('is_new_tuisong":"')[i].split('","remark_id')[0]#提取是否是修改型的公告
            project_type = soup.split('tender_project_type":"')[i].split('","publish_remark_by_zzw')[0]#提取招标种类
            station = visit_single_website(id, mark, project_type) #访问单个网站并插入数据库
            if station == False:
                repeat_num+=1
            if repeat_num>10:
                print("重复过多")
                return
        times= times+1
        time.sleep(0.5)#防止频繁访问
    #return home_dict
    return

def visit_single_website(id, mark, project_type):
    data_address_first1 = 'http://218.2.208.148:9092/api/BiddingNotice/GetByKeys?BNID='  # 普通公告json链接头
    data_address_first2 = 'http://218.2.208.148:9092/api/BiddingNoticeAdditional/GetByKeys?TNAID='  # 更正公告json链接头
    page_address_first1 = 'http://218.2.208.148:9092/OP/Bidding/BiddingNoticeAdditionalView.aspx?bnid='  # 普通公告网址头
    page_address_first2 = 'http://218.2.208.148:9092/OP/Bidding/BiddingNoticeAdditionalView2.aspx?action=view&TNAID='  # 更正公告网址头
    if int(mark)==2: #是更正公告
        page_url = page_address_first2 + id
        address_url = data_address_first2 + id
        try:
            return_data = requests.get(address_url, stream=True) #获取json中的全部内容
            all_content = json.loads(return_data.content)
            try:
                title = str(all_content.get('TNANAME'))  # 更正公告的标题,普通公告中没有这个TNANAME项
            except:
                title = "无公告内容"
            try:
                start_date = str(all_content.get('APPLYBEGINTIME')).replace("T", " ")  # 截止日期
                start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d %H:%M:%S')
            except:
                start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            try:
                deadline = str(all_content.get('APPLYENDTIME')).replace("T", " ")  # 截止日期
                deadline = datetime.datetime.strptime(deadline, '%Y-%m-%d %H:%M:%S')
            except:
                deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            try:
                issue_date = str(all_content.get('CREATETIME')).replace("T", " ")  # 公布日期
                issue_date = datetime.datetime.strptime(issue_date, '%Y-%m-%d %H:%M:%S')
            except:
                issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
        except:
            title = "无公告内容"
            start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
    else:
        page_url = page_address_first1 + id
        address_url = data_address_first1 + id
        try:
            return_data = requests.get(address_url, stream=True)
            all_content = json.loads(return_data.content)
            try:
                title = str(all_content.get('BNNAME'))  # 普通公告的标题
            except:
                title = "无公告内容"
            try:
                start_date = str(all_content.get('APPLYBEGINTIME')).replace("T", " ")  # 截止日期
                start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d %H:%M:%S')
            except:
                start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            try:
                deadline = str(all_content.get('APPLYENDTIME')).replace("T", " ")  # 截止日期
                deadline = datetime.datetime.strptime(deadline, '%Y-%m-%d %H:%M:%S')
            except:
                deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            try:
                issue_date = str(all_content.get('CREATETIME')).replace("T", " ")  # 公布日期
                issue_date = datetime.datetime.strptime(issue_date, '%Y-%m-%d %H:%M:%S')
            except:
                issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
        except:
            title = "无公告内容"
            start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
            issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')
    state = insert_single_data(title, page_url, start_date, deadline, issue_date, project_type) #true或者false计数
    return state


def insert_single_data(title, page_url, start_date, deadline, issue_date, project_type):
    conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)
    my_db = conn['search_record']
    try:

        my_db['JiangSu'].insert_one({"title": title,
                                     "url": page_url,
                                     "start_time": start_date,
                                     "deadline": deadline,
                                     "issue_time": issue_date,
                                     "project_type": project_type,
                                     "created_time": datetime.datetime.now()})
        conn.close()
        print(title, page_url, start_date, deadline, issue_date, project_type)
        print("插入成功")
        return True
    except:
        print("重复,插入失败")
        conn.close()
        return False


def search_jiangsu_record(key_word=None, deadline=None):
    conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)
    my_db = conn['search_record']
    my_col = my_db['JiangSu']
    search_result = []
    if deadline==None:
        deadline = datetime.datetime.now()
    if key_word!=None:
        key_word = make_key_word(key_word)
        for x in my_col.find({"title": re.compile(key_word), "deadline": {"$gte":deadline}}):
            search_result.append(x)
        conn.close()
    else:
        for x in my_col.find({"deadline": {"$gte":deadline}}):
            search_result.append(x)
        conn.close()
    if len(search_result)==0:
        search_result="No Result"
    return search_result

def make_key_word(str_input):
    str_list = str_input.split(" ")
    print(str_list)
    str_more = ""
    for i in str_list:
        str_more = str_more+"(?=.*?"+i+")"
    str_more = "^"+str_more+".+"
    print(str_more)
    return str_more

def crawler_and_search(jiangsu_key_word, jiangsu_date):
    visit_jiangsu_home(x=10)
    str_more = make_key_word(jiangsu_key_word)
    deadline = is_date(jiangsu_date)
    return search_jiangsu_record(key_word=str_more, deadline=deadline)

def is_date(date):
    try:
        deadline = datetime.datetime.strptime(date, '%Y-%m-%d')
    except:
        deadline = None
    return deadline

if __name__ == "__main__":
    jiangsu_key_word = input("关键字:")
    jiangsu_date = input("日期:")
    print(crawler_and_search(jiangsu_key_word, jiangsu_date))

全国数据更新速度快,故设置一个后台更新程序,每十五分钟爬取一次更新数据库:

# -*- coding: utf-8 -*-
import requests
import os
from bs4 import BeautifulSoup
import time
import datetime
import sys,json
from pymongo import MongoClient, InsertOne
import re


# 创建标记文件
def create_mark_file(mark_file_path):
    mark_file = open(mark_file_path, "w")
    mark_file.close()


# 删除标记文件
def delete_mark_file(mark_file_path):
    if os.path.exists(mark_file_path):
        os.remove(mark_file_path)
    else:
        pass


def visit_global_home(key_word=None, max_repeat=200):
    # 模拟头部分
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Content-Length': '181',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Cookie': 'JSESSIONID=585E887A70EA60F7982104CE88AE9B7E; UM_distinctid=1735f8daa6a299-0f79f479f2fe33-b7a1334-144000-1735f8daa6b447; CNZZdate2114438=cnzz_eid%3D1721879153-1595032889-http%253A%252F%252Fwww.infobidding.com%252F%26ntime%3D1595032889',
        'Host': 'www.infobidding.com',
        'Origin': 'http://www.infobidding.com',
        'Referer': 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid=',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
    }

    # 模拟form data部分
    form_data = {
        'cdiqu': '-1',
        'pdiqu': '-1',
        'term': None,
        'areaid': '-1',
        'townid': '-1',
        'pageNum': '1',
        '#request.state': '1',
        '#request.tname': None,
        '#request.type': '2',
        '#request.flg': None,
        '#request.tradeid': None,
    }

    form_data['term'] = key_word
    url = 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid='
    response = requests.post(url, data=form_data, headers=headers)
    # 请求到的信息,解析后选择转为字符串形式方便无脑操作
    soup = BeautifulSoup(response.content, "html5lib")
    if soup.find_all(colspan="5") != []:
        # notice_list.append("没有查到相关数据!")
        # 无内容则退出,不做数据库的任何操作
        print("无内容")
        return
    else:
        all_repeat_num = 0  # 总重复数
        max_repeat_num = max_repeat  # 最大重复数
        all_repeat_num = all_repeat_num + get_name_date_area(soup)  # 对第一页进行插入,返回这一页的重复数并加入总重复数
        #print("完成第1页")
        try:  # 如果有多页,则从第二页开始,先获取总页数
            page_num_is = \
            str(soup.find_all(style="text-align: center;", colspan="4")[0]).split("</font>/")[1].split("页")[0]
            #print("有多页内容")
            time.sleep(1)
            for i in range(2, int(page_num_is) + 1):
                form_data['pageNum'] = str(i)
                response = requests.post(url, data=form_data, headers=headers)
                # 请求到的信息,解析后选择转为字符串形式方便无脑操作
                soup = BeautifulSoup(response.content, "html5lib")
                all_repeat_num = all_repeat_num + get_name_date_area(soup)  # 返回这一页的重复数并加入总重复数
                #print("完成第" + str(i) + "页")
                if all_repeat_num > max_repeat_num:  # 超过则退出
                    return
                time.sleep(1)
        except:
            print("无多页内容")
            return


def get_name_date_area(soup):  # soup是网页内全部内容
    repeat_num = 0  # 每一页的重复数
    for x in soup.find_all(align="left", height="28px"):
        title = str(x.find_all(width="68%")).split('hand;">')[1].split('</span>')[0].strip()
        url_ads = "http://www.infobidding.com" + str(x.find_all(width="68%")).split('href="')[1].split('" target=')[
            0].strip()
        area = str(x.find_all(width="11%")).split('color="red">')[1].split('</font>')[0].strip()
        date = str(x.find_all(width="10%")).split('color="red">')[1].split('</font>')[0].strip()
        date = datetime.datetime.strptime(date, "%Y-%m-%d")

        insertion = insert_global_data(title, url_ads, area, date)  # 插入变为原子行为
        if insertion == False:
            repeat_num += 1
    return repeat_num


# 单项插入
def insert_global_data(title, url_ads, area, date):
    conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)
    my_db = conn['search_record']
    try:
        my_db['Global'].insert_one({"title": title,
                                    "url": url_ads,
                                    "area_city": area,
                                    "issue_date": date,
                                    "created_time": datetime.datetime.now()})
        conn.close()
        #print("插入成功")
        return True  # 成功插入
    except:
        conn.close()
        #print("重复,插入失败")
        return False  # 插入失败


def is_date(notice_list, date=''):
    if notice_list[0] == "没有查到相关数据!":
        lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"
        print(lan)
        return lan
    elif date == '':
        text_content = as_text(notice_list)
        return text_content
    else:
        date_notice = []
        for i in range(len(notice_list)):
            if notice_list[i][3].strip() == date:
                date_notice.append(notice_list[i])
        if len(date_notice) == 0:
            lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"
            print(lan)
            return lan
        text_content = as_text(date_notice)
        return text_content


def as_text(list_text):
    text = ''
    for i in range(len(list_text)):
        text = text + "No." + str(i + 1) + "\n" + "日期:" + str(list_text[i][3]) + "\t" + "地区:" + str(list_text[i][2]) \
               + "\n" + "公告名称:" + str(list_text[i][0]) + "\n" + "链接:" + str(list_text[i][1]) + "\n"
    return text


def search_global_record(key_word=None, issue_date=None):
    conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)
    my_db = conn['search_record']
    my_col = my_db['Global']
    search_result = []
    for x in my_col.find({"title": re.compile(key_word), "issue_date": re.compile(issue_date)}):
        print(x)
        search_result.append(x)
    conn.close()
    return search_result


def make_key_word(str_input):
    str_list = str_input.split(" ")
    print(str_list)
    str_more = ""
    for i in str_list:
        str_more = str_more + "(?=.*?" + i + ")"
    str_more = "^" + str_more + ".+"
    print(str_more)
    return str_more

def is_the_time():
    the_time = str(datetime.datetime.now()).split(" ")[1].split(".")[0].split(":")
    # 只在6点到22点的整点和半点进行后台爬虫
    if (int(the_time[0])>=6 and int(the_time[0])<=22 and (int(the_time[1])==0 or int(the_time[1])==30 or int(the_time[1])==15 or int(the_time[1])==45)):
        return True
    else:
        return False

def backend_crawler():
    mark_file_path = os.getcwd() + "/inserting.txt"
    print("启动后端爬虫更新")
    while(True):
        if is_the_time():
            print(datetime.datetime.now(), "到点爬虫开始")
            if os.path.exists(mark_file_path):
                print("数据库正在手动更新,自动更新暂停,休眠5分钟")
                time.sleep(300)# 休眠5分钟
                while(os.path.exists(mark_file_path)):# 若5分钟后还在手动更新,则继续等5分钟
                    print("数据库正在手动更新,自动更新暂停,再次休眠5分钟")
                    time.sleep(300)  # 休眠5分钟
                # 直到手动更新结束,开始自动更新
                create_mark_file(mark_file_path)  # 创建标记文件
                visit_global_home(max_repeat=200)  # 插入
                delete_mark_file(mark_file_path)  # 删除标记文件
                print(datetime.datetime.now(), "数据库自动更新完成")
            else:# 不在自动更新中则立刻开始自动更新
                create_mark_file(mark_file_path)# 创建标记文件
                visit_global_home(max_repeat=500)# 插入
                delete_mark_file(mark_file_path)# 删除标记文件
                # 完成更新,休眠一段时间
                print(datetime.datetime.now(), "数据库自动更新完成")
        else:
            #print(datetime.datetime.now(), "没到点")
            pass
        #每分钟查看一次是否到设定时间
        time.sleep(60)


if __name__=="__main__":
    backend_crawler()

手工操作时先更新后查询则使用如下部分代码:

# -*- coding: utf-8 -*-
import requests
import os
from bs4 import BeautifulSoup
import time
import datetime
from pymongo import MongoClient
import re

def search_first_or_later(max_repeat=100, key_word=None, issue_date=None):
    mark_file_path = os.getcwd() + "/inserting.txt"
    key_word = make_key_word(key_word)
    if os.path.exists(mark_file_path):# 若存在,说明最近刚更新过
        print("刚更新过,直接搜索")
        search_global_record(key_word=key_word, issue_date=issue_date)# 直接搜索
    else:
        print("先更新再搜索")
        create_mark_file(mark_file_path)
        visit_global_home(max_repeat=max_repeat)# 进行爬取和插入
        search_global_record(key_word=key_word, issue_date=issue_date) #再搜索
        delete_mark_file(mark_file_path)

def update_or_not(max_repeat=100):
    mark_file_path = os.getcwd() + "/inserting.txt"
    if os.path.exists(mark_file_path):# 若存在,说明最近刚更新过
        print("刚更新过")
        pass
    else:
        print("先更新")
        create_mark_file(mark_file_path)
        visit_global_home(max_repeat=max_repeat)# 进行爬取和插入
        delete_mark_file(mark_file_path)

# 创建标记文件
def create_mark_file(mark_file_path):
    mark_file = open(mark_file_path, "w")
    mark_file.close()

# 删除标记文件
def delete_mark_file(mark_file_path):
    if os.path.exists(mark_file_path):
        os.remove(mark_file_path)
    else:
        pass

def visit_global_home(key_word=None, max_repeat=200):
    #模拟头部分
    headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Content-Length': '181',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Cookie': 'JSESSIONID=585E887A70EA60F7982104CE88AE9B7E; UM_distinctid=1735f8daa6a299-0f79f479f2fe33-b7a1334-144000-1735f8daa6b447; CNZZdate2114438=cnzz_eid%3D1721879153-1595032889-http%253A%252F%252Fwww.infobidding.com%252F%26ntime%3D1595032889',
    'Host': 'www.infobidding.com',
    'Origin': 'http://www.infobidding.com',
    'Referer': 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid=',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
    }
    
    #模拟form data部分
    form_data = {
    'cdiqu': '-1',
    'pdiqu': '-1',
    'term': None,
    'areaid': '-1',
    'townid': '-1',
    'pageNum': '1',
    '#request.state': '1',
    '#request.tname': None,
    '#request.type': '2',
    '#request.flg': None,
    '#request.tradeid': None,
    }

    form_data['term'] = key_word
    url = 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid='
    response = requests.post(url, data=form_data, headers=headers)
    # 请求到的信息,解析后选择转为字符串形式方便无脑操作
    soup = BeautifulSoup(response.content, "html5lib")
    if soup.find_all(colspan="5") != []:
        #notice_list.append("没有查到相关数据!")
        #无内容则退出,不做数据库的任何操作
        print("无内容")
        return
    else:
        all_repeat_num = 0# 总重复数
        max_repeat_num = max_repeat# 最大重复数
        all_repeat_num = all_repeat_num + get_name_date_area(soup)#对第一页进行插入,返回这一页的重复数并加入总重复数
        print("完成第1页")
        try:#如果有多页,则从第二页开始,先获取总页数
            page_num_is = str(soup.find_all(style="text-align: center;", colspan="4")[0]).split("</font>/")[1].split("页")[0]
            print("有多页内容")
            time.sleep(1)
            for i in range(2, int(page_num_is)+1):
                form_data['pageNum'] = str(i)
                response = requests.post(url, data=form_data, headers=headers)
                # 请求到的信息,解析后选择转为字符串形式方便无脑操作
                soup = BeautifulSoup(response.content, "html5lib")
                all_repeat_num = all_repeat_num + get_name_date_area(soup)#返回这一页的重复数并加入总重复数
                print("完成第" + str(i) + "页")
                if all_repeat_num > max_repeat_num:#超过则退出
                    return
                time.sleep(1)
        except:
            print("无多页内容")
            return



def get_name_date_area(soup):#soup是网页内全部内容
    repeat_num = 0 #每一页的重复数
    for x in soup.find_all(align="left", height="28px"):
        title = str(x.find_all(width="68%")).split('hand;">')[1].split('</span>')[0].strip()
        url_ads = "http://www.infobidding.com"+str(x.find_all(width="68%")).split('href="')[1].split('" target=')[0].strip()
        area = str(x.find_all(width="11%")).split('color="red">')[1].split('</font>')[0].strip()
        date = str(x.find_all(width="10%")).split('color="red">')[1].split('</font>')[0].strip()
        date = datetime.datetime.strptime(date, "%Y-%m-%d")

        insertion = insert_global_data(title, url_ads, area, date)#插入变为原子行为
        if insertion==False:
            repeat_num += 1
    return repeat_num

# 单项插入
def insert_global_data(title, url_ads, area, date):
    conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)
    my_db = conn['search_record']
    try:
        my_db['Global'].insert_one({"title": title,
                                     "url": url_ads,
                                     "area_city": area,
                                     "issue_date": date,
                                     "created_time": datetime.datetime.now()})
        conn.close()
        print("插入成功")
        return True#成功插入
    except:
        conn.close()
        print("重复,插入失败")
        return False#插入失败


def is_date(notice_list, date=''):
    if notice_list[0] == "没有查到相关数据!":
        lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"
        print(lan)
        return lan
    elif date == '':
        text_content = as_text(notice_list)
        return text_content
    else:
        date_notice = []
        for i in range(len(notice_list)):
            if notice_list[i][3].strip() == date:
                date_notice.append(notice_list[i])
        if len(date_notice)==0:
            lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"
            print(lan)
            return lan
        text_content = as_text(date_notice)
        return text_content

def as_text(list_text):
    text = ''
    for i in range(len(list_text)):
        text = text+"No."+str(i+1)+"\n"+"日期:"+str(list_text[i][3])+"\t"+"地区:"+str(list_text[i][2])\
               +"\n"+"公告名称:"+str(list_text[i][0])+"\n"+"链接:"+str(list_text[i][1])+"\n"
    return text



def search_global_record(key_word=None,issue_date=None):
    conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)
    my_db = conn['search_record']
    my_col = my_db['Global']
    search_result = []
    if issue_date==None:
        deadline = datetime.datetime.now()
    for x in my_col.find({"title": re.compile(key_word), "issue_date": re.compile(issue_date)}):
        print(x)
        search_result.append(x)
    conn.close()
    return search_result

def make_key_word(str_input):
    str_list = str_input.split(" ")
    print(str_list)
    str_more = ""
    for i in str_list:
        str_more = str_more+"(?=.*?"+i+")"
    str_more = "^"+str_more+".+"
    print(str_more)
    return str_more

def is_date(date):
    try:
        deadline = datetime.datetime.strptime(date, '%Y-%m-%d')
    except:
        deadline = None
    return deadline

if __name__=="__main__":
    search_first_or_later(key_word="江苏 有限公司 高速", issue_date="2020-08-20")

收工。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值