环境准备
pip3 install requests
pip3 install lxml
pip3 install chardet
安装完成环境,引入
采集网站标题
import requests
from lxml import etree
import chardet
response = requests.get('https://www.csmzxy.edu.cn/')
encoding = chardet.detect(response.content)['encoding']
if encoding and encoding != 'utf-8':
response.encoding = encoding
html = etree.HTML(response.text)
title = html.xpath('//title/text()')[0]
print(title)
title = html.xpath('//meta[@name="description"]/@content')[0].split(',')
print(title)
采集资讯标题
response = requests.get('https://www.csmzxy.edu.cn/myxw/myyw/516.htm')
encoding = chardet.detect(response.content)['encoding']
if encoding and encoding != 'utf-8':
response.encoding = encoding
html = etree.HTML(response.text)
titles=html.xpath('//div[@id="areacontent"]/ul/li/a/@title')
for title in titles:
if title :
print(title)