爬取新浪、网易、今日头条、UC四大网站新闻标题及内容

最新推荐文章于 2025-11-08 13:28:29 发布

原创

最新推荐文章于 2025-11-08 13:28:29 发布 · 1.6w 阅读

37 ·

CC 4.0 BY-SA版权

本文介绍了一种爬虫技术的实际应用案例，针对新浪、网易、头条和UC四大网站的社会新闻进行抓取。文章详细展示了如何利用Python中的BeautifulSoup库和正则表达式来解析和提取新闻标题及内容。

首先说明一下，文件的命名不能含有:?|"*<>\等英文字符，所以保存为文件的时候需要预处理一下。以下贴的代码都是爬取相应网站的社会新闻内容
新浪：

新浪网的新闻比较好爬取，我是用BeautifulSoup直接解析的，它并没有使用JS异步加载，直接爬取就行了。

from bs4 import BeautifulSoup
from urllib import request
def download(title, url,m):
    req = request.Request(url)
    response = request.urlopen(req)
    response = response.read().decode('utf-8')
    soup = BeautifulSoup(response,'lxml')
    tag = soup.find('div',class_='article')
    if tag == None:
        return 0
    #print(type(tag))
    #print(tag.get_text())
    title = title.replace(':','')
    title = title.replace('"','')
    title = title.replace('|','')
    title = title.replace('/','')
    title = title.replace('\\','')
    title = title.replace('*','')
    title = title.replace('<','')
    title = title.replace('>','')
    title = title.replace('?','')
    #print(tag.get_text())
    filename = r'D:\code\python\spider_news\sina_news\sociaty\\' +title+'.txt'
    with open(filename,'w',encoding='utf8') as file_object:
        file_object.write('           ')
        file_object.write(title)
        file_object.write(tag.get_text())
    print('正在爬取第', m,'个新闻',title)
    return 0
if __name__ == '__main__':
    target_url = 'http://news.sina.com.cn/society/'
    req = request.Request(target_url)
    response = request.urlopen(req)
    response = response.read().decode('utf8')
    #print(response)
    soup = BeautifulSoup(response,'lxml')
    #print(soup.prettify())
    #file = open('d:\\test2.txt','w',encoding='utf8')
    #file.write(soup.prettify())
    y = 0
    for tag in soup.find_all('div',class_='news-item'):
        if tag.a != None:
            if len(tag.a.string) > 8:
                 #print(tag.a.string,tag.a.get('href'))
                 temp = tag.a.string
                 y += 1
                 download(temp,tag.a.get('href'),y)