首先构造一下请求头,调用request模块发送请求,
def request_data(url):
headers = {
'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome /70.0.3538.102Safari/537.36Edge/18.18362'
}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.content.decode('gbk', 'ignore')
except requests.RequestException:
return None
然后用bs4解析一下我们的html网页,
soup = BeautifulSoup(html, 'lxml')
找一下我们前端网页中我们需要的数据的所在标签,获取一下
def get_item(soup):
list = soup.find(class_='listbox').find_all('li')
for item in list:
item_name = item.find('a').string
if item_name is not None:
write_item(item_name)
写入,
def write_item(item):
print('开始写入数据 =======>' + str(item))
with open('56.txt', 'a', encoding='utf-8') as f:
f.write(item+'\n')
f.close()
def main(page):
url = 'http://www.zhongyoo.com/fangji/page_'+str(page)+'.html'
html = request_data(url)
soup = BeautifulSoup(html, 'lxml')
get_item(soup)
一个简单的小爬虫就搞定了,看下结果
开始写入数据 =======>定喘汤
开始写入数据 =======>射干麻黄汤
开始写入数据 =======>黛蛤散
开始写入数据 =======>二母散
开始写入数据 =======>贝母瓜蒌散
开始写入数据 =======>清燥救肺汤