网络爬虫,也叫网络蜘蛛(Web Spider)。它根据网页地址(URL)爬取网页内容,而网页地址(URL)就是我们在浏览器中输入的网站链接。比如:https://www.baidu.com/,它就是一个URL。
今天写一个简单的爬虫脚本,需要先安装requests库和Beautiful Soup库,安装好之后就可以进入实战了。
以爬小说为例
import requests
if __name__ == "__main__":
get = 'https://www.biquge7.com/book/486/7.html'
req = requests.get(url=get)
print(req.text)
可以看到我们已经获取到了html的数据,接下来把想要的东西提取出来,就需要用到Beautiful Soup库了
import requests
from bs4 import BeautifulSoup
if __name__ == "__main__":
get = 'https://www.biquge7.com/book/486/7.html'
req = requests.get(url=get)
print(req.text)
html = req.text
bf = BeautifulSoup(html)
texts = bf.find_all('div', class_='content')
texts=texts[0].text.replace(' ','\r\n')
print(texts)
接下来就需要保存了
import requests
from bs4 import BeautifulSoup
if __name__ == "__main__":
get = 'https://www.biquge7.com/book/486/7.html'
req = requests.get(url=get)
print(req.text)
html = req.text
bf = BeautifulSoup(html)
texts = bf.find_all('div', class_='content')
texts=texts[0].text.replace(' ','\r\n')
print(texts)
file = open("D:\\爬虫\\book.txt",'w',encoding='utf-8')
file.write(texts)
file.close()
file = open("D:\\爬虫\\book.txt",'w',encoding='utf-8')
打开爬虫文件夹里的book.txt,w是如果没有该文件就创建一个,最后得到文件