爬虫学习，小代码，小函数，复述自己的理解

最新推荐文章于 2025-04-11 15:47:45 发布

shuyueliang1

最新推荐文章于 2025-04-11 15:47:45 发布

阅读量240

点赞数

分类专栏：案例总结文章标签： python 爬虫

本文链接：https://blog.youkuaiyun.com/shuyueliang1/article/details/86714249

版权

案例总结专栏收录该内容

12 篇文章

订阅专栏

得到源码

requests

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}  #好像还不能轻易更改为其他的 headers

url = 'https://music.163.com/discover/playlist/?cat=欧美&order=hot&limit=35&offset=35' 
response = requests.get(url=url, headers=headers)
html = response.text   #将网页源码转换为 文本

之前在网上看别人都用的requests

urllib

from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')

目前在看的《python网络数据采集》用的这个库urllib

信息提取

Google浏览器按F12 可以直接显示出网页源码

正则表达式

这是个大工程，慢慢更新吧

BeautifulSoup

浏览器为了把信息解析成更直观的展示形式，通过CSS编辑，给爬虫带来了很多方便，BeautifulSoup就是基于此进行信息获取，十分方便。官网链接BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, "html.parser")

find_all()

find_all
find_all( name , attrs , recursive , text , **kwargs ) #attrs=attributes 属性
#findAll(tag, attributes, recursive, text, limit, keywords) 之前的版本
#find(tag, attributes, recursive, text, keywords)
可以直接通过标签，属性定位到想要的信息

nameList = bs.findAll('span', {'class': 'green'})    #class_='green'与{'class': 'green'}一个意思 
#print(nameList)
for name in nameList:
    print(name)
    print(name.get_text())

需要注意:
1.因为class这个关键字会和内置函数重叠，因此用class_代替，但是只有class_有，id 就没有id_ ！！！所以调用的时候只能老老实实用字典的形式
2.find_all()返回的是一个列表，对单元操作的时候需要用for循环遍历。
3.find和find_all的区别是，find返回的是一个元素

soup.find_all('div', {'id':'text'})

get_text()

把超链接那些奇怪的东西都丢掉，只保留文本
补充，一般到子节点的标签都是 <span>中间部分是文字</span>
<div>中间部分是文字</div> ，div标签里面还可以嵌套其他单元

find_all加get_text()案例

from bs4 import BeautifulSoup  #Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据
import requests
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}  #好像还不能轻易更改为其他的 headers
#time.sleep(2)  #休息2秒，防止被识别为机器
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
response = requests.get(url=url, headers=headers)
html = response.text   #将网页源码转换为 文本
soup = BeautifulSoup(html, 'html.parser') #将文本格式转为与源码格式一致，

爬取全文

body_ = soup.find_all('body')
for content in body_:
    print(content.get_text())

只爬取想要的部分

greenfont = soup.find_all('span', {'class':'green'})
for content in greenfont:
    print(content.get_text())