再次更新:ubuntu下安装pyside
sudo apt-get install python-pyside
sudo apt-get install python3-pyside
update:ubuntu pyside 安装 http://pyside.readthedocs.io/en/latest/building/linux.html
使用ghost.py(webkit)可以很方便爬取javascript接口等生成数据
ghost.py安装
第一步:安装PySide (ubuntu), centos下安装参照PySide官网(yum install qtwebkit qtwebkit-devel)
sudo apt-get install cmake
sudo apt-get install libqt4-dev
sudo apt-get install qt4-dev-tools
sudo apt-get install qtmobility-dev
sudo apt-get install python2.7-dev
sudo apt-get install libphonon-dev
pip install wheel
wget https://pypi.python.org/packages/source/P/PySide/PySide-1.2.2.tar.gz
tar -xvzf PySide-1.2.2.tar.gz
cd PySide-1.2.2
python setup.py bdist_wheel --qmake=/usr/bin/qmake-qt4
python pyside_postinstall.py -install
第一步2: 如果在没有X的linux系统下使用ghost.py还需要安装 xvfb
sudo apt-get install xvfb
yum install xorg-X11-server-Xvfb
用xvfb执行:
xvfb-run --auto-servernum --server-args="-screen 0 1280x760x24" python x.py
第二步: 安装ghost.py
pip install ghost.py
appannie 网站数据分析可知,游戏列表数据是javascript生成的,如果使用requests不能直接用 xpath 匹配, 用ghost.py可以很方便的使用 xpath
配合lxml使用爬取 appannie 网站的应用
# -*- coding: utf-8 -*-
from ghost import Ghost
import lxml.html
agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'
ghost = Ghost(user_agent=agent, wait_timeout=120)
ghost.set_proxy('socks5', '192.168.1.111', 1080) # 使用socks5代理
page, extra_resources = ghost.open('https://www.appannie.com/apps/google-play/publisher/20200000600489/?&page=2')
ghost.wait_for_text('data-ref="main"', timeout=60) # 等待网页的'data-ref="main"'出现
html = lxml.html.fromstring(ghost.content)
e = html.xpath('//*[@id="container"]/div[2]/div[2]/div/div[2]/div/div[2]/div[1]/div[2]/table/tbody')[0] #
for tr in e.getchildren():
print tr.getchildren()[3].text