要想自动下载知网期刊论文,先要在chrome浏览器里登录好帐号密码。然后在输入框里输入搜索词语,点击搜索按钮。注意自己的网速,设置好暂停的秒数后,再往下执行。
app.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
from selenium.webdriver import ActionChains
service = Service(executable_path="./driver/chromedriver.exe")
driver = webdriver.Chrome(service=service)
url = "https://www.cnki.net/"
driver.get(url=url)
try:
title_input = driver.find_element(By.XPATH, '//input[@id="txt_SearchText"]')
ActionChains(driver=driver).click(title_input).send_keys("教学设计").perform()
title_enter = driver.find_element(By.XPATH, '//input[@class="search-btn"]')
ActionChains(driver).click(title_enter).perform()
# ActionChains(driver).scroll_by_amount(0, 200).perform()
time.sleep(5)
tag_as = driver.find_elements(By.XPATH, '//a[@class="fz14"]')
for index, tag_a in enumerate(tag_as):
if index == 0:
tag_url = tag_a.get_attribute("href")
driver.get(tag_url)
time.sleep(5)
pdf = driver.find_element(By.XPATH,'//a[@id="pdfDown"]')
ActionChains(driver).click(pdf).perform()
time.sleep(10)
# page_footer = driver.find_element(By.XPATH, '//div[@class="footer"]')
# ActionChains(driver).scroll_to_element(page_footer).perform()
except Exception as e:
print(e)
time.sleep(3)
driver.quit()
跳转到搜索结果页面后,需要暂停个5秒,等页面加载完,才能使用xpath查找到相应的dom元素,因为列表都是使用异步加载到相应的模块里面的。
在获取所有的文章链接列表代码后
tag_as = driver.find_elements(By.XPATH, '//a[@class="fz14"]')
就可以循环该a标签的dom元素列表,遍历到所有的文章链接了。
for index, tag_a in enumerate(tag_as): if index == 0: tag_url = tag_a.get_attribute("href") driver.get(tag_url) time.sleep(5) pdf = driver.find_element(By.XPATH,'//a[@id="pdfDown"]') ActionChains(driver).click(pdf).perform() time.sleep(10)
通过for循环,获取a标签的属性href链接,然后再使用driver.get()进入到文章页面,暂停5秒钟等文章页面加载完成后,点击pdf按钮,自动下载文章的pdf文件。
chrome默认下载到下载目录\Documents\Downloads里。