Python 爬虫如何获取 1688 商品详情（代码示例）

本文链接：https://blog.youkuaiyun.com/wanbangAPI01/article/details/147424009

在电商领域，获取 1688 商品的详细信息对于市场分析、选品上架、库存管理和价格策略制定等方面至关重要。1688 作为国内领先的 B2B 电商平台，提供了丰富的商品资源。通过 Python 爬虫技术，我们可以高效地获取 1688 商品的详细信息，包括商品名称、价格、图片、描述等。本文将详细介绍如何使用 Python 爬虫获取 1688 商品详情，并提供完整的代码示例。

一、准备工作

（一）安装必要的库

确保你的开发环境中已经安装了以下库：

requests：用于发送 HTTP 请求。
BeautifulSoup：用于解析 HTML 内容。
Selenium：用于处理动态加载的内容。

可以通过以下命令安装这些库：

bash

pip install requests beautifulsoup4 selenium

（二）下载 ChromeDriver

为了使用 Selenium，需要下载与浏览器版本匹配的 ChromeDriver，并确保其路径正确配置。

二、编写爬虫代码

（一）发送 HTTP 请求

使用 requests 库发送 GET 请求，获取商品页面的 HTML 内容。

Python

import requests

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        print("Failed to retrieve the page")
        return None

（二）解析 HTML 内容

使用 BeautifulSoup 解析 HTML 内容，提取商品详情。

Python

from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    product_info = {}

    # 提取商品名称
    product_name = soup.find('h1', class_='product-title').text.strip()
    product_info['product_name'] = product_name

    # 提取商品价格
    product_price = soup.find('span', class_='price').text.strip()
    product_info['product_price'] = product_price

    # 提取商品描述
    product_description = soup.find('div', class_='product-description').text.strip()
    product_info['product_description'] = product_description

    # 提取商品图片
    product_image = soup.find('img', class_='main-image')['src']
    product_info['product_image'] = product_image

    return product_info

（三）整合代码

将上述功能整合到主程序中，实现完整的爬虫程序。

Python

def main():
    url = "https://detail.1688.com/offer/123456789.html"
    html = get_html(url)
    if html:
        product_info = parse_html(html)
        print("商品名称:", product_info['product_name'])
        print("商品价格:", product_info['product_price'])
        print("商品描述:", product_info['product_description'])
        print("商品图片:", product_info['product_image'])

if __name__ == "__main__":
    main()

三、处理动态加载的内容

如果商品详情页的内容是动态加载的，可以使用 Selenium 获取完整的页面内容。

Python

from selenium import webdriver
import time

def get_html_dynamic(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # 无头模式
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    
    # 等待页面加载完成
    time.sleep(3)
    
    html = driver.page_source
    driver.quit()
    return html

（四）完整示例代码

结合动态加载的内容，完整的示例代码如下：

Python

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        print("Failed to retrieve the page")
        return None

def get_html_dynamic(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # 无头模式
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    
    # 等待页面加载完成
    time.sleep(3)
    
    html = driver.page_source
    driver.quit()
    return html

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    product_info = {}

    # 提取商品名称
    product_name = soup.find('h1', class_='product-title').text.strip()
    product_info['product_name'] = product_name

    # 提取商品价格
    product_price = soup.find('span', class_='price').text.strip()
    product_info['product_price'] = product_price

    # 提取商品描述
    product_description = soup.find('div', class_='product-description').text.strip()
    product_info['product_description'] = product_description

    # 提取商品图片
    product_image = soup.find('img', class_='main-image')['src']
    product_info['product_image'] = product_image

    return product_info

def main():
    url = "https://detail.1688.com/offer/123456789.html"
    html = get_html_dynamic(url)  # 使用动态加载的方式获取页面内容
    if html:
        product_info = parse_html(html)
        print("商品名称:", product_info['product_name'])
        print("商品价格:", product_info['product_price'])
        print("商品描述:", product_info['product_description'])
        print("商品图片:", product_info['product_image'])

if __name__ == "__main__":
    main()