使用selenium爬取淘宝商品信息

最新推荐文章于 2024-05-16 16:37:49 发布

原创最新推荐文章于 2024-05-16 16:37:49 发布 · 756 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#selenium #爬虫

爬虫专栏收录该内容

5 篇文章

订阅专栏

本文详细介绍如何使用Selenium爬取淘宝商品信息，包括发送请求、输入搜索词、解析商品数据及翻页操作，最终将数据写入CSV文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用selenium爬取淘宝商品信息

1. 发送请求

driver = webdriver.Chrome()
# 打开某些网页，会有广告遮住找到的tag，设置窗口的大小就可以避免
# driver.set_window_size(1920, 1080)
driver.implicitly_wait(6)
url = 'https://www.taobao.com/'
driver.get(url)

2. 输入商品名称，并且自动搜索

定位到输入框，设置想输入的信息，我这里以iPad为例

然后定位到搜索按钮的位置，设置点击

driver.find_element_by_xpath('//*[@id="q"]').send_keys('iPad')
driver.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button').click()

3. 解析出需要爬取的数据

通过xpath找到单个商品所有信息的div

遍历并用xpath解析出需要的数据，我这里只取出了商品信息，价格，店铺这三个信息，解析其他信息使用同样的方法即可

注：解析出的信息为string类型，如有特殊字符，定义方法清洗即可

divs = driver.find_elements_by_xpath('//div[contains(@class, "J_MouserOnverReq")]')
    for div in divs:
        shop = div.find_element_by_xpath('.//div[2]/div[3]/div[1]').text
        goods = div.find_element_by_xpath('.//div[2]/div[2]/a').text.strip()
        price = div.find_element_by_xpath('.//div[2]/div[1]/div[1]').text

4. 循环遍历下一页数据

定位到“下一页”按钮的a标签，这里需要注意，如果是最后一页，“下一页”的a标签并不能找到，会报NoSuchElementException错误，这时候捕捉异常，然后用break结束循环即可

有些同学可能会报selenium.common.exceptions.StaleElementReferenceException: Message: stale eleme这个错误，这是因为页面还没加载完，通过xpath解析不出来“下一页”的标签，这时候就会报错。解决办法有几个：

刷新页面几乎是万能的方法driver.refresh()，但有些情况不允许刷新页面，就用不了了
设置等待时间time.sleep(2)，这个方法不是每次都会奏效的
捕捉异常StaleElementReferenceException，然后重新获取元素，此方法也比较靠谱

while True:
	try:
        next_page = driver.find_element_by_xpath('//a[contains(@trace, "srp_bottom_pagedown")]')
    except NoSuchElementException as e:
        print('爬取完毕！')
        break
    else:
        time.sleep(4)
        print('第{}页爬取完毕'.format(driver.find_element_by_xpath('//li[@class="item active"]').text))
        next_page.click()
        time.sleep(3)
        driver.refresh()

5. 最后附上完整代码

我是将爬到的数据直接写入一个表格中了，并没有写入数据库，写入数据库的操作，我觉得并不难，并不是我想偷懒

from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchElementException
import csv

driver = webdriver.Chrome()
# driver.set_window_size(1920, 1080)
driver.implicitly_wait(6)
url = 'https://www.taobao.com/'
driver.get(url)

driver.find_element_by_xpath('//*[@id="q"]').send_keys('iPad')
driver.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button').click()
with open('iPad.csv', 'a', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(('商品', '价格', '店铺'))

while True:
    divs = driver.find_elements_by_xpath('//div[contains(@class, "J_MouserOnverReq")]')
    for div in divs:
        shop = div.find_element_by_xpath('.//div[2]/div[3]/div[1]').text
        goods = div.find_element_by_xpath('.//div[2]/div[2]/a').text.strip()
        price = div.find_element_by_xpath('.//div[2]/div[1]/div[1]').text
        with open('iPad.csv', 'a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow((goods, price, shop))

    try:
        next_page = driver.find_element_by_xpath('//a[contains(@trace, "srp_bottom_pagedown")]')
    except NoSuchElementException as e:
        print('爬取完毕！')
        break
    else:
        time.sleep(4)
        print('开始爬取第{}页'.format(driver.find_element_by_xpath('//li[@class="item active"]').text))
        next_page.click()
        time.sleep(3)
        driver.refresh()

driver.quit()