简易网页爬虫实战-优快云博客

本文链接：https://blog.youkuaiyun.com/demicalooool/article/details/104398958

首先构造一下请求头,调用request模块发送请求,

def request_data(url):
    headers = {
        'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome /70.0.3538.102Safari/537.36Edge/18.18362'
    }
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.content.decode('gbk', 'ignore')
    except requests.RequestException:
        return None

然后用bs4解析一下我们的html网页,

soup = BeautifulSoup(html, 'lxml')

找一下我们前端网页中我们需要的数据的所在标签,获取一下

def get_item(soup):
    list = soup.find(class_='listbox').find_all('li')
    for item in list:

        item_name = item.find('a').string
        if item_name is not None:

            write_item(item_name)

写入,

def write_item(item):

    print('开始写入数据 =======>' + str(item))
    with open('56.txt', 'a', encoding='utf-8') as f:
        f.write(item+'\n')
        f.close()

def main(page):
    url = 'http://www.zhongyoo.com/fangji/page_'+str(page)+'.html'
    html = request_data(url)

    soup = BeautifulSoup(html, 'lxml')
    get_item(soup)

一个简单的小爬虫就搞定了,看下结果

开始写入数据 =======>定喘汤
开始写入数据 =======>射干麻黄汤
开始写入数据 =======>黛蛤散
开始写入数据 =======>二母散
开始写入数据 =======>贝母瓜蒌散
开始写入数据 =======>清燥救肺汤

爬虫初学,写个简单的爬虫