线程与进程
当你在散步时,你可以一边哼歌,一边听歌,一边走路,种种动作都是一条线程,而你作为一个整体,即进程,多进程是你和好哥们三五成群一起散步、听歌、走路...
当我们爬虫时,主要依靠网速,而不是榨干cpu的性能,我们只要在一个进程里分配多个线程任务就行了,因此选用了threading库
threading的基本介绍
把打开网页比作如下的数数,每次数数间隔一秒,那么只有1线程的话就要十秒,但2线程10秒可以数20次。
import threading
import time
def count_numbers(start, end):
for i in range(start, end):
print(f"Thread {threading.current_thread().name}: {i}")
time.sleep(1)
if __name__ == "__main__":
# 创建两个线程
thread1 = threading.Thread(target=count_numbers, args=(1, 10), name='Thread-1')
thread2 = threading.Thread(target=count_numbers, args=(1, 10), name='Thread-2') # 增加线程2的数数任务量
# 启动两个线程
thread1.start()
thread2.start()
join()的简单用法
假设cpu一个进程每秒运算100次,那么两个进程被分到的运算次数就各占50次,假设某一任务是1线程运算30次,2线程运算70次。那么代码表示就是
import threading
import time
def count_numbers(start, end):
for i in range(start, end):
print(f"Thread {threading.current_thread().name}: {i}")
if __name__ == "__main__":
# 创建两个线程
thread1 = threading.Thread(target=count_numbers, args=(1, 31), name='Thread-1')
thread2 = threading.Thread(target=count_numbers, args=(1, 71), name='Thread-2') # 增加线程2的数数任务量
# 启动两个线程
thread1.start()
thread2.start()
显然1线程比2线程要先结束,但是1线程依然霸占着这一进程一半的速度,那么我们可以用join函数释放1线程,这样在1线程结束之后,就用整个进程跑2线程了,代码如下。
import threading
import time
def count_numbers(start, end):
for i in range(start, end):
print(f"Thread {threading.current_thread().name}: {i}")
if __name__ == "__main__":
# 创建两个线程
thread1 = threading.Thread(target=count_numbers, args=(1, 31), name='Thread-1')
thread2 = threading.Thread(target=count_numbers, args=(1, 71), name='Thread-2') # 增加线程2的数数任务量
# 启动两个线程
thread1.start()
thread2.start()
#将1线程释放
thread1.join()
queue库
queue正如其翻译:队列。利用queue.put()可以挨个把元素放进队列,而queue.get()即挨个取出。简单的示例:
from queue import Queue
line = Queue()
for i in range(6):
line.put(i)
for i in range(6):
print(line.get())
实战练习——在ncbi中用多线程提取转录本序列
假设我们已有基因名,通过对上面两个函数的观察,做出一种假设:
1.用request函数爬出基因名的所有蛋白质名,将它们存贮在队列里
2.按转录本的数量直接无限制开启线程,遍历队列(由于一个基因转录本不会多得离谱,所以一般来说不会死机)
代码如下:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import os
import threading
from queue import Queue
headers = {
'User-Agent': 你的ueragent
}
#配置ChromeService
current_directory = os.getcwd()
chrome_path = '你的地址/chromedriver.exe'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_service = ChromeService(chrome_path)
#读取基因id
def get_geneid():
with open(f'{current_directory}\geneid.fasta','r') as geneid:
line = geneid.readlines()
lines_stripped = [line.strip() for line in line]
return lines_stripped
#获得proid
def get_proid(genename):
a = requests.get(f'https://www.ncbi.nlm.nih.gov/gene/{genename}',headers=headers).text
soup = BeautifulSoup(a,'html.parser')
allprname = soup.find_all('a')
proid = []
for i in allprname:
pro = str(i.string)
if 'NP' in pro[0:2]:
proid.append(pro)
return (proid)
#获得fasta
def getfasta(proid,genename):
driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
driver.get(f'https://www.ncbi.nlm.nih.gov/protein/{proid}?report=fasta')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'pre')))
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
fasta = soup.find_all('pre')
with open(f'{current_directory}/{genename}.fasta', 'a') as b:
for m in fasta:
a = m.text
b.write(a)
geneline = get_geneid()
for i in geneline:
proid = get_proid(i)
protein_queue = Queue()
for j in proid:
protein_queue.put(j)
while not protein_queue.empty():
thread1 = threading.Thread(target=getfasta,args=(protein_queue.get(),i))
thread1.start()
目前测试了10个左右基因,效果已经远大于单线程了,并且从上一篇文章的有头改成无头式以加快速度。