几个坑点,元祖列表字典里的汉字,要一个一个打出来才不会有乱码,否则会有乱码.....或者你用join把它们连接起来就好了,或者少年用python3吧.....
python制表符和空格不能混用,所以如果他告诉你缩进错了,可能就是这个问题,具体怎么查,可以用notepad,看一下制表符和缩进
剩下的,注意注释里的编码
#coding:utf-8
import requests
import re
import sys
import json #解决python用ascii编码问题
reload(sys)
sys.setdefaultencoding('utf-8')
from datetime import datetime
from bs4 import BeautifulSoup
commentURL='http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20'
#format是一种强大的格式化字符串的工具,把id挖走变成{},用format就能填进去任意id
def getCommentCount(newsurl): #获取评论函数
m=re.search('doc-i(.*).shtml',newsurl) #取出newsurl的newsid
newsid=m.group(1)
comments=requests.get(commentURL.format(newsid));
jd=json.loads(comments.text.strip('var data='))
return jd['result']['count']['total']
def getNewsDetail(newsurl):
result={}
res=requests.get(newsurl)
res.encoding='utf-8'
soup=BeautifulSoup(res.text,'html.parser')
result['title']=soup.select('#artibodyTitle')[0].text
timesource=soup.select('.time-source')[0].contents[0].strip()
result['dt']=datetime.strptime(timesource,'%Y年%m月%d日%H:%M') #python基于ASCII处理字符的,当出现不属于ASCII的字符时,会出现错误信息。所以代码开头加一个处理
result['source']=soup.select('.time-source span a')[0].text
result['article']=''.join([p.text.strip() for p in soup.select('#artibody p')[:-1]]) #一行写法
result['editor']=soup.select('.article-editor')[0].text.lstrip('责任编辑:')
result['comments']=getCommentCount(newsurl)
return result
if __name__=="__main__":
result=getNewsDetail('http://news.sina.com.cn/o/2016-12-21/doc-ifxyxvcr7229277.shtml')
for i in result:
print i,result[i]