知乎爬虫（基于selenium）

知乎爬虫实战

最新推荐文章于 2025-09-11 22:10:09 发布

转载最新推荐文章于 2025-09-11 22:10:09 发布 · 136 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/sniper-huohuohuo/p/8678050.html

文章标签：

#python #爬虫 #数据库

本文介绍了一个使用Selenium实现的知乎爬虫案例，演示了从登录到抓取数据的全过程，并将数据保存为文本文件。

今天写一下关于知乎的爬虫。利用selenium实现爬去数据.

思路：打开网页选择登录界面-------->选择二维码登录------>点击“发现”------>在输入框中输入要查询的内容，回车--------->把滚动条下拉到最下面------------->获取所有的信息，写入txt文件中。

总的代码：

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 
 4 from selenium import webdriver
 5 from selenium.webdriver.common.keys import Keys
 6 from selenium.webdriver.common.action_chains import ActionChains
 7 import time
 8 from selenium.webdriver.common.by import By
 9 from selenium.webdriver.support.wait import WebDriverWait
10 from selenium.webdriver.support import expected_conditions as EC
11 
12 url = 'https://www.zhihu.com/'
13 browser = webdriver.Chrome()
14 browser.implicitly_wait(10)
15 browser.get(url)
16 
17 #登录帐号,点击登录。
18 a = browser.find_element_by_xpath('//span[@data-reactid="94"]')
19 WebDriverWait(browser,15,0.6).until(EC.presence_of_element_located((By.XPATH,'//span[@data-reactid="94"]')))
20 ActionChains(browser).click(a).perform()
21 
22 #定位二维码登录。
23 b = browser.find_element_by_xpath('//button[@class="Button Button--plain"]')
24 WebDriverWait(browser,15,0.7).until(EC.visibility_of_element_located((By.XPATH,'//button[@class="Button Button--plain"]')))
25 ActionChains(browser).click(b).perform()
26 time.sleep(15)
27 #登录后查询关键词
28 #定位
29 c = browser.find_element_by_link_text('发现')
30 WebDriverWait(browser,17,0.7).until(EC.presence_of_element_located((By.LINK_TEXT,'发现')))
31 ActionChains(browser).click(c).perform()
32 
33 #点击搜索输入框
34 d = browser.find_element_by_xpath('//input[@placeholder="搜索你感兴趣的内容..."]')
35 WebDriverWait(browser,15,0.5).until(EC.visibility_of_element_located((By.XPATH,'//input[@placeholder="搜索你感兴趣的内容..."]')))
36 ActionChains(browser).click(d).perform()
37 e = input('请输入你要查询的内容：')
38 ActionChains(browser).send_keys(e).perform()
39 ActionChains(browser).send_keys(Keys.ENTER).perform()
40 #拉动滚动条到最底部
41 for x in range(20000):
42     ActionChains(browser).send_keys(Keys.DOWN).perform()
43 #获取所有的标题及链接
44 url_list = browser.find_elements_by_xpath('//div[@class="List-item"]//a')
45 title_list = browser.find_elements_by_xpath('//div[@class="List-item"]//a/span[@class="Highlight"]')
46 
47 info_dict = {}
48 for k,v in zip(url_list,title_list):
49     urls = k.get_attribute('href')
50     titles = v.text
51     info_dict[titles] = urls
52 
53 print(info_dict)
54 #写入文本文档
55 with open('/home/xxxxxxxx/桌面/知乎爬虫/%s.txt' % e,'w+') as text_document:
56     for ka,va in info_dict.items():
57         text_document.write('%s-------------------------%s\n' %(ka,va))
58 text_document.close()
59 
60 browser.quit()