数据结构化与保存

最新推荐文章于 2025-09-12 17:31:31 发布

weixin_30723433

最新推荐文章于 2025-09-12 17:31:31 发布

阅读量49

点赞数

CC 4.0 BY-SA版权

文章标签：数据结构与算法 python

原文链接：http://www.cnblogs.com/SOLARLKS/p/8810051.html

1. 将新闻的正文内容保存到文本文件。

         f 
         = 
         open
         (
         'content.txt'
         , 
         'a'
         , encoding 
         =
         'utf - 8'
         )
        

         # content为文本正文
        

         f.write(content)
        

         f.close()
        

2. 将新闻数据结构化为字典的列表:

         news 
         =
         {}
        

         # 读取新闻细节
        

         def 
         getNewDetail(detail,title,description):
        

             
         resDescript 
         = 
         requests.get(detail,headers)
        

             
         resDescript.encoding 
         = 
         'utf-8'
        

             
         soupDescript 
         = 
         BeautifulSoup(resDescript.text, 
         'html.parser'
         )
        

             
         news[
         'title'
         ]
         =
         soupDescript.select(
         '.show-title'
         )[
         0
         ].text
        

          
        

             
         content 
         = 
         soupDescript.select(
         '.show-content'
         )[
         0
         ].text  
         # 正文
        

             
         info 
         = 
         soupDescript.select(
         ".show-info"
         )[
         0
         ].text  
         # info相关内容
        

             
         # 第一种方法 分离 message = info.split()
        

             
         # 第二种方法 用正则表达式
        

             
         print
         (
         '标题' 
         + 
         ': ' 
         + 
         title)
        

             
         print
         (
         '概要' 
         + 
         ': ' 
         + 
         description)
        

             
         print
         (
         '链接' 
         + 
         ': ' 
         + 
         detail)
        

             
         print
         (
         '正文' 
         + 
         ' :' 
         + 
         content)
        

             
         if
         (re.search(
         "发布时间:(.*) \xa0\xa0 \xa0\xa0作者："
         , info) !
         =
         'Null' 
         ):
        

                 
         time 
         = 
         re.search(
         "发布时间:(.*) \xa0\xa0 \xa0\xa0作者："
         , info).group(
         1
         )
        

                 
         news[
         'time'
         ]
         =
         time
        

             
         else
         :news[
         'time'
         ]
         =
         "null"
        

             
         if 
         (re.search(
         "作者：(.*)\xa0\xa0审核："
         , info) !
         =
         'Null'
         ):
        

                 
         author 
         = 
         re.search(
         "作者：(.*)\xa0\xa0审核："
         , info).group(
         1
         )
        

                 
         news[
         'author'
         ]
         =
         author
        

                 
         print
         (
         "作者：" 
         + 
         author)
        

             
         else
         :news[
         'author'
         ]
         =
         "null"
        

             
         if 
         (re.search(
         "审核：(.*)\xa0\xa0来源："
         , info) !
         =
         'Null'
         ):
        

                 
         right 
         = 
         re.search(
         "审核：(.*)\xa0\xa0来源："
         , info).group(
         1
         )
        

                 
         news[
         'right'
         ]
         =
         right
        

             
         else
         :news[
         'right'
         ]
         =
         "null"
        

             
         if 
         (re.search(
         '来源：(.*)\xa0\xa0\xa0\xa0摄影：'
         , info) !
         = 
         "null"
         ):
        

                 
         resource 
         = 
         re.search(
         '来源：(.*)\xa0\xa0\xa0\xa0摄影：'
         , info)
        

                 
         news[
         'resource'
         ] 
         = 
         resource
        

             
         else
         :news[
         'resource'
         ]
         =
         "null"
        

             
         if 
         (re.search(
         "摄影：(.*)\xa0\xa0\xa0\xa0点击："
         , info)!
         =
         "Null"
         ):
        

                 
         video 
         = 
         re.search(
         "摄影：(.*)\xa0\xa0\xa0\xa0点击："
         , info)
        

                 
         news[
         'video'
         ]
         =
         video
        

             
         else
         :news[
         'video'
         ]
         =
         "null"
        

             
         count 
         = 
         getNewsId(detail)
        

             
         news[
         'count'
         ]
         =
         content
        

             
         dateTime 
         = 
         datetime.strptime(time, 
         '%Y-%m-%d %H:%M:%S'
         )
        

             
         news[
         'dataTime'
         ]
         =
         dateTime
        

3. 安装pandas，用pandas.DataFrame(newstotal)，创建一个DataFrame对象df.

df = pandas.DataFrame(all_news)

4. 通过df将提取的数据保存到csv或excel 文件。

1	`df.to_excel(` `'news.xlsx'` `)`

5. 用pandas提供的函数和方法进行数据分析：

提取包含点击次数、标题、来源的前6行数据
提取‘学校综合办’发布的，‘点击次数’超过3000的新闻。
提取'国际学院'和'学生工作处'发布的新闻。

         df[[
         'clicks'
         , 
         'title'
         , 
         'source'
         ]].head(
         6
         )
        

         df[(df[
         'clicks'
         ] > 
         3000
         ) & (df[
         'source'
         ] 
         =
         = 
         '学校综合办'
         )]
        

         news_info 
         = 
         [
         '国际学院'
         , 
         '学生工作处'
         ]
        

         df[df[
         'source'
         ].isin(news_info)]