最近看python比较流行,尤其是爬网站数据,自己照着大神的模板写了简单的程序,还没怎么熟悉python语法,仅做留念
1、python安装:
Python最新源码,二进制文档,新闻资讯等可以在Python的官网查看到:
Python官网:https://www.python.org/
你可以在以下链接中下载 Python 的文档,你可以下载 HTML、PDF 和 PostScript 等格式的文档。
Python文档下载地址:https://www.python.org/doc/
安装完不能用的,需要配置下环境变量,如果pid命令不能用那就把script目录也加到环境变量中
2、在说说工具吧,用UE也行,只是中文乱码的问题一直没解决,改为用eclipse,安装插件:
- eclipse菜单 -> Help -> Install New Software... -> Work with (Add..)
- Name:PyDev
- Location:http://pydev.org/updates
eclipse和python关联:
eclipse菜单 -> Windows ->Preferences -> PyDev-> Interpreters - Python Interpreter.
点击New按钮,选择python.exe的路径(第1步安装Python的路径),打开后显示出一个包含很多复选框的窗口,点OK结束!
如果报错:Unable to create the selected preference page. An error occurred while automatically activating bundle org.python.pydev,安装旧版插件试下:路径Location=https://dl.bintray.com/fabioz/pydev/old/
3、新建项目:
3、整了个sqlite3数据库,很小,安装很简单,又下了个Sqlite studio3.1.1操作数据库用,建了个books.db,建了个comments表
CREATE TABLE comments (
moveId VARCHAR (20),
nickname VARCHAR (1000),
comment VARCHAR (2000),
rate VARCHAR (20),
city VARCHAR (50),
start_time VARCHAR (50)
);
用的猫眼,需要抓取猫眼接口数据,这个还不知道怎么弄,先把大神的拿来用了,类似下面的地址:
"http://m.maoyan.com/mmdb/comments/movie/1208282.json?_v_=yes&offset=15"
程序没啥东西,抓取json数据,遍历取出需要的值,存入数据库中
import requests
import sqlite3
import json
import time
from datetime import datetime
from datetime import timedelta
def getMovieInfo(url) :
session = requests.Session()
headers = {
"User-Agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like mac OS X)"
}
response = session.get(url, headers=headers)
if response.status_code == 200:
#print response.json()
return response.text
return None
def saveData(moveId, nickname, comment, rate, city, start_time):
conn = sqlite3.connect('d:/books.db')
conn.text_factory=str
cursor = conn.cursor()
ins="insert into comments values (?,?,?,?,?,?)"
v = (moveId, nickname, comment, rate, city, start_time)
cursor.execute(ins, v)
cursor.close()
conn.commit()
conn.close()
def parseInfo():
start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
end_time = '2018-11-20 00:00:00'
i = 0
while start_time > end_time:
print start_time
url = 'http://m.maoyan.com/mmdb/comments/movie/1203084.json?_v_=yes&offset=0&startTime=' + start_time.replace(' ', '%20')
#html = getMovieInfo("http://m.maoyan.com/mmdb/comments/movie/1208282.json?_v_=yes&offset=1")
try:
html = getMovieInfo(url)
except Exception as e:
time.sleep(0.5)
html = get_data(url)
else:
time.sleep(1)
data = json.loads(html)['cmts']
for item in data:
i+=1
saveData(i, item['nickName'], item['content'], item['score'], item['cityName'], item['startTime'])
start_time = item['startTime']
parseInfo()
通过时间来遍历请求数据,防止请求过于频繁,加了每次请求完后等待时间
图形化输出,需要的模块:pyecharts
pip install pyecharts
地图模块:
选择自己需要的安装
$ pip install echarts-countries-pypkg
$ pip install echarts-china-provinces-pypkg
$ pip install echarts-china-cities-pypkg
$ pip install echarts-china-counties-pypkg
$ pip install echarts-china-misc-pypkg
$ pip install echarts-united-kingdom-pypkg
https://www.jianshu.com/p/e0b2851672cd
# -*- coding:utf-8 -*-
from pyecharts import Geo
from collections import Counter
from pyecharts import Style
def render():
data = [(u'广州', 80), (u'漳州', 180)]
#handle(cities)
#data = Counter(cities).most_common()
style = Style(
title_color='#fff',
title_pos='center',
width=1200,
height=600,
background_color='#404a59'
)
geo = Geo('位置分布', '数据来源:111', **style.init_style)
attr, value = geo.cast(data)
geo.add('', attr, value, visual_range=[0, 3500],visual_text_color='#fff', symbol_size=15,is_visualmap=True, is_piecewise=True, visual_split_number=10)
geo.render('11111111112.html')
render()
词云展示:
jieba是一个基于Python的分词库,完美支持中文分词,功能强大
pip install jieba
Matplotlib是一个Python的2D绘图库,能够生成高质量的图形,可以快速生成绘图、直方图、功率谱、柱状图、误差图、散点图等
pip install matplotlib
wordcloud是一个基于Python的词云生成类库,可以生成词云图
pip install wordcloud
程序:
# -*- coding:utf-8 -*-
# 导入jieba模块,用于中文分词
import jieba
# 导入matplotlib,用于生成2D图形
import matplotlib.pyplot as plt
# 导入wordcount,用于制作词云图
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# 获取所有评论
comments = []
comments.append(u"很好哦")
comments.append(u"完美")
comments.append(u"很美")
comments.append(u"大美")
print(comments)
# with open('comments.txt', mode='r', encoding='utf-8') as f:
# rows = f.readlines()
# for row in rows:
# comment = row.split(',')[3]
# if comment != '':
# comments.append(comment)
# 设置分词
comment_after_split = jieba.cut(str(comments), cut_all=False) # 非全模式分词,cut_all=false,中文无法正确显示,应该为str(comments).decode("unicode-escape")
words = ' '.join(comment_after_split) # 以空格进行拼接
print(words)
# 设置屏蔽词
stopwords = STOPWORDS.copy()
stopwords.add('电影')
stopwords.add('一部')
stopwords.add('一个')
stopwords.add('没有')
stopwords.add('什么')
stopwords.add('有点')
stopwords.add('这部')
stopwords.add('这个')
stopwords.add('不是')
stopwords.add('真的')
stopwords.add('感觉')
stopwords.add('觉得')
stopwords.add('还是')
stopwords.add('但是')
stopwords.add('就是')
stopwords.add('一出')
stopwords.add('好戏')
# 导入背景图
bg_image = plt.imread('bg.jpg')
# 设置词云参数,参数分别表示:画布宽高、背景颜色、背景图形状、字体、屏蔽词、最大词的字体大小
wc = WordCloud(width=1024, height=768, background_color='white', mask=bg_image, font_path='STKAITI.TTF',
stopwords=stopwords, max_font_size=400, random_state=50)
# 将分词后数据传入云图
wc.generate_from_text(words)
plt.imshow(wc)
plt.axis('off') # 不显示坐标轴
plt.show()
# 保存结果到本地
wc.to_file('词云图.jpg')
现在的问题是,中文不能显示出来,只能显示unicode编码,在处理list时,需要转化下:
str(comments).decode("unicode-escape")