selenium可以模拟登陆网页,爬取时可以应用phanTomjs(无页面浏览器),为了便于观察,下面我会应用两种方法:
# driver = webdriver.PhantomJS(executable_path='E:\APP\phantomjs-2.1.1-windows\\bin\phantomjs.exe')
driver=webdriver.Firefox()
注释掉的部分为无页面浏览器登录,下面为火狐浏览器登录,二者的不同点就是无页面浏览器需要注明路径,但在爬取过程中使用火狐浏览器更易于观察。
loginurl='http://t.people.com.cn/indexV3.action'
driver.get(loginurl)
driver.find_element_by_xpath('//*[@id="userName"]').send_keys("账号")
driver.find_element_by_xpath('//*[@id="password_text"]').send_keys("密码")
这里面用到了xpath的相对路径的书写。(有一个简便方法查找固定元素的xpath(* ̄︶ ̄*)o)单击鼠标右键,选择复制,选择xpath就可以得到它的绝对路径啦。
当然,相对路径还得自己书写╭(╯^╰)╮。
爬取微博内容:
driver.find_elements_by_class_name('list_item')
滚动条滑动到底部:
for i in range(4):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
一般来说滚动条最多滚动三次。。
隐式等待,应用于页面无法瞬间加载的时候:
driver.implicitly_wait(50)
点击下一页:
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div[6]/div/a[10]').click()
实现了翻两页emmmm……
完整代码如下:
#coding:utf-8
import time
from selenium import webdriver
user_agent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'
headers = {'User-Agent': user_agent}
# driver = webdriver.PhantomJS(executable_path='E:\APP\phantomjs-2.1.1-windows\\bin\phantomjs.exe')
driver=webdriver.Firefox()
driver.maximize_window()
loginurl='http://t.people.com.cn/indexV3.action'
driver.get(loginurl)
driver.find_element_by_xpath('//*[@id="userName"]').send_keys("18133591951")
driver.find_element_by_xpath('//*[@id="password_text"]').send_keys("87654321.")
driver.find_element_by_xpath("/html/body/div[3]/div[2]/div[2]/div[2]/form/div[4]/input").click()
driver.find_element_by_xpath('/html/body/div[2]/div/div[3]/a[2]').click()
driver.find_element_by_xpath('/html/body/div[2]/div/a[1]').click()
for k in range(2):
for j in range(4):
m = driver.find_elements_by_class_name('list_item')
for i in m:
print i.text
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.implicitly_wait(50)
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div[6]/div/a[10]').click()
selenium可以模拟登陆网页,爬取时可以应用phanTomjs(无页面浏览器),为了便于观察,下面我会应用两种方法:
# driver = webdriver.PhantomJS(executable_path='E:\APP\phantomjs-2.1.1-windows\\bin\phantomjs.exe')
driver=webdriver.Firefox()
loginurl='http://t.people.com.cn/indexV3.action'
driver.get(loginurl)
driver.find_element_by_xpath('//*[@id="userName"]').send_keys("账号")
driver.find_element_by_xpath('//*[@id="password_text"]').send_keys("密码")
driver.find_elements_by_class_name('list_item')
for i in range(4):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.implicitly_wait(50)
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div[6]/div/a[10]').click()
#coding:utf-8
import time
from selenium import webdriver
user_agent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'
headers = {'User-Agent': user_agent}
# driver = webdriver.PhantomJS(executable_path='E:\APP\phantomjs-2.1.1-windows\\bin\phantomjs.exe')
driver=webdriver.Firefox()
driver.maximize_window()
loginurl='http://t.people.com.cn/indexV3.action'
driver.get(loginurl)
driver.find_element_by_xpath('//*[@id="userName"]').send_keys("18133591951")
driver.find_element_by_xpath('//*[@id="password_text"]').send_keys("87654321.")
driver.find_element_by_xpath("/html/body/div[3]/div[2]/div[2]/div[2]/form/div[4]/input").click()
driver.find_element_by_xpath('/html/body/div[2]/div/div[3]/a[2]').click()
driver.find_element_by_xpath('/html/body/div[2]/div/a[1]').click()
for k in range(2):
for j in range(4):
m = driver.find_elements_by_class_name('list_item')
for i in m:
print i.text
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.implicitly_wait(50)
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[1]/div[6]/div/a[10]').click()