python启动线程池爬虫代码解释及遇到的错误

最新推荐文章于 2022-04-25 15:37:02 发布

布吃

最新推荐文章于 2022-04-25 15:37:02 发布

阅读量376

点赞数

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/younger_to_older/article/details/96169974

python 专栏收录该内容

12 篇文章

订阅专栏

本文介绍了一个使用Python爬取猫眼电影榜单的实例，包括网页请求、正则表达式解析、多进程爬取等内容，并解决了爬虫过程中常见的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.代码如下

# -*- coding:utf-8 -*-
import requests
import re
from requests.exceptions import RequestException
import json
from multiprocessing import Pool
def get_one_page(url):
    try:
        reponse = requests.get(url)
        # 200表示请求成功，HTTP状态码，表
        # 示网络请求成功的意思
        # ，返回这个状态表示已经获取到数据了
        if reponse.status_code == 200:
            return reponse.text
        return None
    except RequestException:

return None

def parse_one_page(html):

# pattern = re.compile('<img.*?src="(.*?)".*?alt="(.*?)".*?/>',re.S)
 pattern = re.compile('<dd>.*?board-index.*?>(\d+).*?data-src="(.*?)".*?name"><a'
 +'.*?>(.*?)</a>.*?star">(.*?).*?releasetime">(.*?)'
 +'.*?integer">(.*?).*?fraction">(.*?).*?</dd>',re.S)

items = re.findall(pattern,html)

    for item in items:
        yield{
            'index':item[0],
            'image':item[1],
            'title':item[2],
            'actor':item[3].strip()[3:],
            'date':item[4].strip()[5:],
            'score':item[5]+item[6]
        }
def write_to_file(content):
    with open('result.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + '\n')
        f.close()

def main(offset):
    url = "https://maoyan.com/board/4?offset="+str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):

write_to_file(item)
# print(html)

if __name__ == '__main__':
    # for i in range(10):
    #     main(i*10)
    pool = Pool()
    pool.map(main,[i*10 for i in range(10)])

2.解释

pattern = re.compile('。。。。。。',re.S）文本中有换行要加re.S，不加会导致不能换行而爬虫失败

将爬取的内容写入到 .py文件的当前目录。
通过json.dumps (）将之前以字典形式获取的数据转化为字符串，然后写入文件，但是此时字符串是以ascii形式存储的
def write_to_file(content):
    with open('result.txt','a',) as f:
        f.write(json.dumps(content) + '\n')
        f.close()

解决方法，如下：
def write_to_file(content):
    with open('result.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + '\n')
        f.close()

启动线程池，通过map方法将函数中的每一个元素当做map()的参数，然后创建一个个进程放到线程池中，从而创建多个线程进行爬虫
。
线程池提供指定数量的进程，如果池没满的话就会创建进程，当进程池满时的就会等待
pool = Pool()
pool.map(main,[i*10 for i in range(10)])