【爬虫实例1】python3下使用beautifulsoup爬取数据并存储txt文件

最新推荐文章于 2024-10-28 10:14:39 发布

原创

最新推荐文章于 2024-10-28 10:14:39 发布 · 6.5k 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫 #python3爬虫 #beautifulsoup

本文档展示了在Python3环境下，利用BeautifulSoup库进行网页爬取的实例。首先介绍了运行环境（Python 3.7.0，Windows系统，PyCharm IDE），接着列举了必需的库requests和beautifulsoup。接着给出了完整的爬虫代码，最后展示了运行结果。虽然只是一个学习Demo，但旨在帮助读者理解爬虫基本原理。

1：运行环境：

python： 3.7.0
系统：Windows
IDE：pycharm 2017

2：需要安装的库：

requests 和 beautifulsoup

3：完整代码：

 # coding:utf-8
    import requests
    from bs4 import BeautifulSoup
    import  bs4
    
    
    def gethtml(url,headers):
        response =  requests.get(url,headers=headers)
        try:
            if response.status_code == 200:
                print('抓取成功网页长度：',len(response.text))
                response.encoding = 'utf-8'
                return response.text
        except BaseException as e:
            print('抓取出现错误：',e)
    
    def getsoup(html):
        soup = BeautifulSoup(html,'lxml')
        for tr in soup.find('tbody').children:  #生成tr的tag列表
            if isinstance(tr,bs4.element.Tag):
                td = tr('td')          #循环获取所有tr标签下的td标签，并生成tag列表
                t = [td[0].string, td[1].string,'    ',td[2].string,'   ',td[3].string]   #提取前四td字符串