抓取新浪网的标题

最新推荐文章于 2020-11-20 20:42:54 发布

原创最新推荐文章于 2020-11-20 20:42:54 发布 · 337 阅读

1 ·

CC 4.0 BY-SA版权

Python 专栏收录该内容

22 篇文章

订阅专栏

刚想开始入手的时候，想着爬今日头条来的，但是看它那么难，就往后放了放，一直在跟前辈们沟通，但是中间也不能闲下来呀，所以，就穿插着又爬了一下新浪网的新闻，这个就比较简单了，直接查看网页源代码就可以找到他的信息，那接下来就直接上代码就可以了噻~

# coding=utf-8

import re
from bs4 import BeautifulSoup
import urllib2

class XL():
    def get_text(self):
        url = 'http://news.sina.com.cn/society/'
        try:
            req = urllib2.Request(url)
            response = urllib2.urlopen(req)
            webdata = response.read()
            re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.I)  # 去掉一些冗余出来的数据
            webdata = re_script.sub("", webdata)
            webdata = re.sub('<div id="latestNewsNotification"[^>]*>[^<]*</div>', '', webdata)  # 去掉-有{n}条新闻更新，请点击查看
            webdata = re.sub('</a> <a target="(.*)</a>', '', webdata)  # 去掉-详细
            webdata = re.sub('<div><span class(.*)</a></span>', '', webdata)  # 去掉-Close
            webdata = re.sub('<span class=(.*)</a></span></div>', '', webdata)  # 去掉登录
            webdata = re.sub('<div class="action">(.*)</span></span></div>', '', webdata)  # 去掉-分享、评论
            webdata = re.sub('<div class="time">(.*)</div>', '\n', webdata)  # 去掉时间

            soup = BeautifulSoup(webdata, 'html.parser')
            tags = soup.find('div', class_="blk6-c")    #从此处一直到此类型包括的结束处
            return tags
        except Exception as error:
            print error

    def get_title(self):
        file = open('xinlang.txt', 'a')
        tag = self.get_text()
        m = 0
        for item in tag:
            if item not in ['\n', '\t', '']:
                m = m + 1
                item = str(m) + 'title' + item.get_text('\n', strip=True)
                file.write(item.encode('utf-8'))
                file.write('\n')
        file.close()

if __name__ == '__main__':
    XL().get_title()

以上就是我想要的信息的抓取代码，即下图所示的信息，其中有标题还有原来的原标题。