第一步:获取页面(引入requests模块)
import requests
link='http://tieba.baidu.com/p/5753427007'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
r=requests.get(link,headers=headers)
print(r.text)
代码解析:上述代码获取的是贴吧的HTML代码.使用requests.get(link,headers=headers)获取网页.
- 用requests的headers伪装成浏览器访问.
- r是requess的Response回复对象,我们从中可以获取网页的内容代码
- headers的获取,网上自查
第二步:提取需要的数据(引入BeautifulSoup模块)
import requests
from bs4 import BeautifulSoup
link='http://tieba.baidu.com/p/5753427007'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,'lxml')
title=soup.find('h1',class_='core_title_tx