小白都能看懂的python爬虫_python爬虫小白csdn-优快云博客

本文链接：https://blog.youkuaiyun.com/Nina_ningning/article/details/127521470

本文介绍使用Python进行爬虫开发，爬取journal所有文章标题。先利用requests包模拟浏览器点击获取页面内容，介绍了headers中user - agent的获取方法。再用beautifulsoup解析内容，通过find_all()等函数查找指定内容，遍历定位到的内容获取新URL继续爬取，最后将所需内容写入txt文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

首先附上一段代码，这段代码是为了爬jounral的所有文章的title

import requests
from bs4 import BeautifulSoup
headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
        }
url = "https://www.aeaweb.org/journals/aeri/issues"
response = requests.get(url = url, headers = headers)
r = response.text
soup = BeautifulSoup(r,'lxml')
all_issues=soup.find_all('div',style='margin-top:5px;')
fp = open('aeri.txt', 'w', encoding = 'utf-8')
for i in range(len(all_issues)):
    issue='https://www.aeaweb.org'+all_issues[i].find('a')['href']
    new_response = requests.get(url = issue, headers = headers)
    newr = new_response.text
    newsoup = BeautifulSoup(newr,'lxml')
    all_titles=newsoup.find_all('h3',class_='title')
    for j in range(len(all_titles)):
        try:
            #webs='https://www.aeaweb.org/'+all_titles[j].find('a')['href']
            #fp.write(webs+'\n')
            fp.write(all_titles[j].find('a').text+'\n')
        except:
            print("no")

首先用到requests这个包，pip install requests 安装包，然后import它。这个包主要是让程序模拟人在浏览器上点击。url就是我们要访问的网址，另一个参数是headers。headers如何获取？
step 1: 随便一个网页上右击“检查”（inspect）
step 2: 在最上面一栏找到network，然后会看到类似下面的界面

step 3:左栏随便点，直到找到header中有user-agent的，比如见下图

step 4: 复制这里的user-agent 就是上面代码中的headers.
代码中使用requests.get() function以后就获得这个页面所有的内容了。
接下来我们需要用beautifulsoup去解析这个内容。
解析完以后可以find_all()或者find() functions去查找指定的内容了。
比如首先我们定位的volumes他们都有一个共同的规律就是’div’,style=‘margin-top:5px;’，所以我们就用这个条件去定位所有符合条件的内容。得到的这个内容其实是一个列表。接下来，要用一个for循环去遍历每一个volume。
对每一个volume，我们都能获取其单独的url，然后我们按照上面的步骤重新爬这个新的url上面的内容。
最后，把自己想要的内容write上txt文件上。