爬虫相关模块命令回顾

最新推荐文章于 2025-08-10 21:30:56 发布

weixin_30950607

最新推荐文章于 2025-08-10 21:30:56 发布

阅读量55

点赞数

CC 4.0 BY-SA版权

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/sun1994/p/8572644.html

本文介绍如何利用Python中的requests库获取网页内容，以及如何使用BeautifulSoup解析网页，提取所需信息。包括基本的安装命令、网页内容获取方法、编码设置、元素查找等实用技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1、requests模块

                  1、 pip install requests

                  2、 response = requests.get(‘http://www.baidu.com/ ‘)            #获取指定url的网页内容

                  3、 response.text                                                                                #获取文本文件

                  4、 response.content                                                                         #获取字节类型

                  5、 response.encoding = ‘utf-8’                                                       #指定获取的网页内容用utf-8编码

                      response.encoding = response.apparent_encoding       #下载的页面是什么编码就用什么编码格式

                  6、 response.cookies                                                                         #拿到cookies

                      response.cookies.get_dict()                               #拿到cookie字典样式
 2、beautisoup模块

                  1、 pip install beautifulsoup4

                  2、 把文本转成对象

　　　　　　　　1）html.parser 是python内置模块无需安装

　　　　　　　　　　soup = BeautiSoup(response.text,parser=‘html.parser‘)

　　　　　　　　2）lxml是第三方库，但是性能好（生产用这个

                                   soup = BeautifulSoup(response.text,features=‘lxml‘)

                  3、 .find()用法：返回的是对象

　　　　　　　　1）从爬取的内容找到id="auto-channel-lazyload-article" 中div的内容

                                   target = soup.find(id="auto-channel-lazyload-article")

　　　　　　　　2） 从爬取的内容中找到一个div，并且这个div有一个属性是id=’i1’

                                   target = soup.find(‘div‘,id=‘i1‘)

                  4、 .find_all()用法：返回的是对象列表

                          1） 从以后取的target对象中找到所有li标签

                                   li_list = target.find_all(‘li‘)

                  5、 从.find()获取的对象中找到想要的属性

　　　　　　　　a.attrs.get(‘href‘)                                                #获取所有a标签的所有href属性（a标签url路径）

　　　　　　　　a.find(‘h3‘).text                                                   #找到a标签中的所有h3标签，的内容

　　　　　　　　img_url = a.find(‘img‘).attrs.get(‘src‘)       #从a标签中找到img标签所有src属性(图片url路径)