python爬虫基础

最新推荐文章于 2020-07-13 15:36:20 发布

原创最新推荐文章于 2020-07-13 15:36:20 发布 · 408 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python基础 # 爬虫

爬虫专栏收录该内容

3 篇文章

订阅专栏

爬虫：

定向爬虫：指定域名进行爬取
非定向爬虫：不指定域名，自由地爬取，例如google，baidu

爬虫的应用领域：

搜索引擎
舆情系统
信息收集

爬虫的基本工具：

request：

response = requests.get(url) 请求html
response.text 返回文本
response.content 返回原始html
response.encoding 编码方式
response.apparent_encoding 页面的编码方式
response.status_code 状态码： 200-成功 300-重定向
response.cookies.get_dict()
requests.get(url, cookie = {'x':'xxx'})

BeautifulSoup(结构化html为对象) or PyQuery(在python中操作javascript)

   BeautifulSoup(html,parser) 格式化html
   soup.find() 返回第一个
   soup.find_all() 返回全部，为Tag对象
   soup.select() css选择器，返回list
   obj.text 获取文本
   obj.attrs 获取属性