python爬虫学习之路1

Python爬虫入门与实战：从requests到BeautifulSoup

最新推荐文章于 2025-12-05 17:02:52 发布

原创最新推荐文章于 2025-12-05 17:02:52 发布 · 215 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python

python 同时被 2 个专栏收录

6 篇文章

订阅专栏

python爬虫

3 篇文章

订阅专栏

本文介绍了作者学习Python爬虫的第一天经历，包括使用requests进行GET和POST请求，设置User-Agent，以及使用正则表达式和BeautifulSoup进行网页内容解析。通过实例展示了爬取搜狗和百度翻译的流程，并提供了爬取豆瓣电影的简单示例，为初学者提供了一个基础的爬虫学习路径。

前言：

今天是我学习爬虫的第一天。因为上周末打CTF比赛，有一道题要用的爬虫来进行爆破，所以我就打算学习爬虫了。

文章目录

前言：

request包的使用

基本的爬虫编程：四部曲
（1）. 构建url
（2）.使用get方法或者post方法来进行传输数据，返回一个响应对象
（3）.获取对象中的数据
（4）.利用文件知识，进行储存内容
下面是get方法

import requests
# 构建URL
url = 'https://www.sogou.com/'
# 使用get方法发送请求，从而返回一个数据对象
response = requests.get(url=url)
# 获取数据
page_text = response.text
print(page_text)
# 持久性存储
with open('./sogou.html', 'w', encoding='UTF-8') as fp:
    fp.write(page_text)

需要在url进行UA伪装：就是添加User-Agent内容，一般使用正规浏览器的UA。
post来传输数据时，需要使用字典来对参数进行处理，一般抓包来查看post的参数。

import requests
import json
# 定义url
url='https://fanyi.baidu.com/?aldtype=16047#en/zh/'
# UA伪装
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'
}
# 传入参数,并对参数进行字典处理
scan=input('请输入你查询的东西')
pa ={
    'kw': 'scan'
}
# 发送post请求给后端,获取响应对象
response = requests.post(url=url, data=pa, headers=headers)
# 获取响应数据
'''page_text = response.text'''
dic_obj=response.json()
# 持久性存储
'''filename = scan+'.html'
with open(filename,'w',encoding='utf-8') as fp:
    fp.write(page_text)'''
fp = open('./dog.json','w',encodeing='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False)
print('over')

使用正则表达式

python正则表达式参考文献：https://www.runoob.com/python/python-reg-expressions.html
需要 import re
re.match方法（匹配规则，要匹配的内容）
.*?可以匹配除了\n,\r以外的内容，可以再末尾加上re.s来匹配全部
re.search,会将第一个结果返回
re.findall,会返回全部内容
re.compile,可以封装匹配规则。
参考内容：https://mp.weixin.qq.com/s/t4hXKK-pjA8rIVmJuiyQcw

BeautifulSoup的使用

可以很好的解决正则表达式的问题
一般用lxml来解析
soup = BeautifulSoup(response.text, 'lxml')

得到标题内容：

print(soup.title.string)

得到p标签的内容：

print(soup.p.string)

得到超链接：

print(soup.a)
print(soup.find_all('a'))

得到id=link2的内容：

print(soup.find(id='link2'))

得到全部内容：

print(soup.get_text)

select语法：

print(soup.select("title"))
data = soup.select('body > div.container.logo-search > div.row > div.col.logo > h1 > a')
print(data)

爬取豆瓣电影的实战

import re
import requests
import json

def main(page):
    url = "http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-"+str(page)
    # 获取数据
    html = request_dandan(url)
    # 解析过滤我们想要的数据
    itmes = parse_result(html)

def request_dandan(url):
    try:
        response = requests.get(url)
        if response.status_code==200:
            return response.text
    except requests.RequestException:
        return None

def parse_result(html):
    patter = re.compile('<li>.*?list_num.*?(\d+).</div>.*?<img src="(.*?)".*?class="name".*?title="(.*?)">.*?class="star">.*?class="tuijian">(.*?)</span>.*?class="publisher_info">.*?target="_blank">(.*?)</a>.*?class="biaosheng">.*?<span>(.*?)</span></div>.*?<p><span\sclass="price_n">&yen;(.*?)</span>.*?</li>',re.S)
    items = re.findall(patter, html)
    '''for item in items:
        result = {
            'range': 'item[0]',
            'image': ' item[1]',
            'title': 'item[2]',
            'recommend': 'item[3]',
            'author': 'item[4]',
            'times': 'item[5]',
            'price': 'item[6]'
        }'''
    write_item_to_file(items)
def write_item_to_file(item):
    print('开始写入数据=====>>'+str(item))
    with open('./book.txt','a',encoding='UTF-8') as fp:
        fp.write(json.dumps(item,ensure_ascii=False)+'/n')
        fp.close()

if __name__ == "__main__":
    for i in range(1, 26):
        main(i)