从零开始的爬虫学习——多线程思路

最新推荐文章于 2024-05-11 05:00:00 发布

励志成为生信高手

最新推荐文章于 2024-05-11 05:00:00 发布

阅读量659

点赞数 17

文章标签：学习

本文链接：https://blog.youkuaiyun.com/qwesdqweds/article/details/135274652

版权

线程与进程

当你在散步时，你可以一边哼歌，一边听歌，一边走路，种种动作都是一条线程，而你作为一个整体，即进程，多进程是你和好哥们三五成群一起散步、听歌、走路...

当我们爬虫时，主要依靠网速，而不是榨干cpu的性能，我们只要在一个进程里分配多个线程任务就行了，因此选用了threading库

threading的基本介绍

把打开网页比作如下的数数，每次数数间隔一秒，那么只有1线程的话就要十秒，但2线程10秒可以数20次。

import threading
import time


def count_numbers(start, end):
    for i in range(start, end):
        print(f"Thread {threading.current_thread().name}: {i}")
        time.sleep(1)

if __name__ == "__main__":
    # 创建两个线程
    thread1 = threading.Thread(target=count_numbers, args=(1, 10), name='Thread-1')
    thread2 = threading.Thread(target=count_numbers, args=(1, 10), name='Thread-2')  # 增加线程2的数数任务量

    # 启动两个线程
    thread1.start()
    thread2.start()

join（）的简单用法

假设cpu一个进程每秒运算100次，那么两个进程被分到的运算次数就各占50次，假设某一任务是1线程运算30次，2线程运算70次。那么代码表示就是

import threading
import time


def count_numbers(start, end):
    for i in range(start, end):
        print(f"Thread {threading.current_thread().name}: {i}")

if __name__ == "__main__":
    # 创建两个线程
    thread1 = threading.Thread(target=count_numbers, args=(1, 31), name='Thread-1')
    thread2 = threading.Thread(target=count_numbers, args=(1, 71), name='Thread-2')  # 增加线程2的数数任务量

    # 启动两个线程
    thread1.start()
    thread2.start()

显然1线程比2线程要先结束，但是1线程依然霸占着这一进程一半的速度，那么我们可以用join函数释放1线程，这样在1线程结束之后，就用整个进程跑2线程了，代码如下。

import threading
import time


def count_numbers(start, end):
    for i in range(start, end):
        print(f"Thread {threading.current_thread().name}: {i}")

if __name__ == "__main__":
    # 创建两个线程
    thread1 = threading.Thread(target=count_numbers, args=(1, 31), name='Thread-1')
    thread2 = threading.Thread(target=count_numbers, args=(1, 71), name='Thread-2')  # 增加线程2的数数任务量

    # 启动两个线程
    thread1.start()
    thread2.start()

    #将1线程释放
    thread1.join()

queue库

queue正如其翻译：队列。利用queue.put()可以挨个把元素放进队列，而queue.get()即挨个取出。简单的示例：

from queue import Queue

line = Queue()
for i in range(6):
    line.put(i)
for i in range(6):
    print(line.get())

实战练习——在ncbi中用多线程提取转录本序列

假设我们已有基因名，通过对上面两个函数的观察，做出一种假设：

1.用request函数爬出基因名的所有蛋白质名，将它们存贮在队列里

2.按转录本的数量直接无限制开启线程，遍历队列（由于一个基因转录本不会多得离谱，所以一般来说不会死机）

代码如下：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import os
import threading
from queue import Queue
headers = {
    'User-Agent': 你的ueragent
}
#配置ChromeService
current_directory = os.getcwd()
chrome_path = '你的地址/chromedriver.exe'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_service = ChromeService(chrome_path)

#读取基因id
def get_geneid():
    with open(f'{current_directory}\geneid.fasta','r') as geneid:
        line = geneid.readlines()
        lines_stripped = [line.strip() for line in line]
        return lines_stripped

#获得proid
def get_proid(genename):
    a = requests.get(f'https://www.ncbi.nlm.nih.gov/gene/{genename}',headers=headers).text
    soup = BeautifulSoup(a,'html.parser')
    allprname = soup.find_all('a')
    proid = []
    for i in allprname:
        pro = str(i.string)
        if 'NP' in pro[0:2]:
           proid.append(pro)
    return (proid)

#获得fasta
def getfasta(proid,genename):
    driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
    driver.get(f'https://www.ncbi.nlm.nih.gov/protein/{proid}?report=fasta')
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'pre')))
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')
    fasta = soup.find_all('pre')
    with open(f'{current_directory}/{genename}.fasta', 'a') as b:
        for m in fasta:
            a = m.text
            b.write(a)

geneline = get_geneid()
for i in geneline:
    proid = get_proid(i)
    protein_queue = Queue()
    for j in proid:
        protein_queue.put(j)
    while not protein_queue.empty():
        thread1 = threading.Thread(target=getfasta,args=(protein_queue.get(),i))
        thread1.start()

目前测试了10个左右基因，效果已经远大于单线程了，并且从上一篇文章的有头改成无头式以加快速度。