爬虫代码

bitwind

于 2018-10-22 23:13:15 发布

阅读量1.9k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： python 文章标签： python爬虫

本文链接：https://blog.youkuaiyun.com/bitwind/article/details/83280640

这段代码是一个使用Python进行网络爬虫的实现，主要利用BeautifulSoup库解析HTML页面，获取小说的章节名和内容，并将其保存到本地文件。爬取过程包括获取网页HTML，解析章节链接，按顺序下载章节内容，并在下载过程中显示进度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬虫代码，备忘。
#coding=utf-8
#__author__ = chengzhipeng

import re
import os
import sys
from bs4 import BeautifulSoup
from urllib import request
import ssl
# url = 'http://www.biqiuge.com/book/4772/'
# url = 'https://www.qu.la/book/1/'
url = 'http://www.biquge.com.tw/14_14055/'

def getHtmlCode(url):
    page = request.urlopen(url)
    html = page.read()
    htmlTree = BeautifulSoup(html,'html.parser')
    return htmlTree
    #return htmlTree.prettify()
def getKeyContent(url):
    htmlTree = getHtmlCode(url)

def parserCaption(url):
    htmlTree = getHtmlCode(url)
    storyName = htmlTree.h1.get_text() + '.txt'

    print('小说名:',storyName)
    aList = htmlTree.find_all('a',href=re.compile('(\d)*.html'))  #aList是一个标签类型的列表，class = Tag 写入文件之前需要转化为str
    #print(int(aList[1]['href'][0:-5]))
    print(aList)
    aDealList = []
    for line in