jupyter notebook下采集标题和文本并存入txt文档.py

最新推荐文章于 2025-08-02 00:53:46 发布

原创最新推荐文章于 2025-08-02 00:53:46 发布 · 4.7k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#python3

python 专栏收录该内容

11 篇文章

订阅专栏

本文介绍了一个使用Python进行网页爬取的小项目，通过Jupyter Notebook环境，利用requests和BeautifulSoup库从特定网站抓取苏轼的一首词，并将其保存为TXT文件。此教程适合初学者了解基本的网页爬虫技术。

#jupyter notebook下采集苏轼的一首词

import requests
from bs4 import BeautifulSoup
import re
import os
import pandas as pd

url = 'http://www.shicimingju.com/chaxun/list/3710.html'
r = requests.get(url)
html = r.text.encode(r.encoding).decode()
soup = BeautifulSoup(html,'lxml')
#soup

#正文内容
content = soup.find('div',attrs ="shici-content").text
content = re.sub('(\n|\s)','',content)#数据清洗，将空格 换行符等去除掉  使用正则
#content

#标题内容
title = soup.find('h1',attrs ="shici-title").text
#title

#####存储到TXT文档中'
filedir = os.getcwd() + '/苏轼的词'
if not os.path.exists(filedir):
    os.mkdir(filedir)
with open(filedir + '/%s.txt'%title,mode = 'w',encoding = 'utf-8') as f:
    f.write(title + '\n' + content)