https://github.com/Santostang/PythonScraping
http://pan.baidu.com/s/1c2w9rck
www.santostang.com
http://www.taobao.com/robots.txt
Anaconda
Jupyter
#!/usr/bin/python
# coding: UTF-8
import requests
link = 'http://www.santostang.com/'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;'
' rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
print(r.text)
返回的内容为:
C:\Python37\python.exe E:/python/work/scratch/tst.py
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<title>Santos Tang</title>
<meta name="description" content="Python网络爬虫:从入门到实践 官方网站及个人博客" />
<meta name="keywords" content="Python网络爬虫, Python爬虫, Python, 爬虫, 数据科学, 数据挖掘, 数据分析, santostang, Santos Tang, 唐松, Song Tang" />
<link rel="apple-touch-icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png">
<link rel="apple-touch-icon" sizes="152x152" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_152.png">
<link rel="apple-touch-icon" sizes="167x167" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_167.png">
<link rel="apple-touch-icon" sizes="180x180" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_180.png">
<link rel="icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png" type="image/x-icon">
<link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/bootstrap.min.css">
<link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/fontawesome.min.css">
<link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/style.css">
<link rel="pingback" href="http://www.santostang.com/xmlrpc.php" />
<style type="text/css">
a{color:#1e73be}
a:hover{color:#2980b9!important}
#header{background-color:#1e73be}
.widget .widget-title::after{background-color:#1e73be}
.uptop{border-left-color:#1e73be}
#titleBar .toggle:before{background:#1e73be}
</style>
</head>
<body>
<header id="header">
<div class="avatar"><a href="http://www.santostang.com" title="Santos Tang"><img src="http://www.santostang.com/wp-content/uploads/2019/06/me.jpg" alt="Santos Tang" class="img-circle" width="50%"></a></div>
<h1 id="name">Santos Tang</h1>
<div class="sns">
<a href="https://weibo.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Weibo"><i class="fab fa-weibo"></i></a> <a href="https://www.linkedin.com/in/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Linkedin"><i class="fab fa-linkedin"></i></a> <a href="https://www.zhihu.com/people/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Zhihu"><i class="fab fa-zhihu"></i></a> <a href="https://github.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="GitHub"><i class="fab fa-github-alt"></i></a> </div>
<div class="nav">
<ul><li><a href="http://www.santostang.com/">首页</a></li>
<li><a href="http://www.santostang.com/aboutme/">关于我</a></li>
<li><a href="http://www.santostang.com/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬虫书代码</a></li>
<li><a href="http://www.santostang.com/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a></li>
<li><a href="https://santostang.github.io/">EnglishSite</a></li>
</ul> </div>
<div class="weixin">
<img src="http://www.santostang.com/wp-content/uploads/2019/06/qrcode_for_gh_370f70791e19_258.jpg" alt="微信公众号" width="50%">
<p>微信公众号</p>
</div>
</header>
<div id="main">
<div class="row box">
<div class="col-md-8">
<h2 class="uptop"><i class="fas fa-arrow-circle-up"></i> <a href="http://www.santostang.com/2018/07/11/%e3%80%8a%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%ef%bc%9a%e4%bb%8e%e5%85%a5%e9%97%a8%e5%88%b0%e5%ae%9e%e8%b7%b5%e3%80%8b%e4%b8%80%e4%b9%a6%e5%8b%98%e8%af%af/" target="_blank">《网络爬虫:从入门到实践》一书勘误</a></h2>
<article class="article-list-1 clearfix">
<header class="clearfix">
<h1 class="post-title"><a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/">第四章 – 4.3 通过selenium 模拟浏览器抓取</a></h1>
<div class="post-meta">
<span class="meta-span"><i class="far fa-calendar-alt"></i> 07月15日</span>
<span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
<span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/#respond">没有评论</a></span>
<span class="meta-span hidden-xs"><i class="fas fa-tags"></i> </span>
</div>
</header>
<div class="post-content clearfix">
<p>4.3 通过selenium 模拟浏览器抓取
在上述的例子中,使用Chrome“检查”功能找到源地址还十分容易。但是有一些网站非常复杂,例如前面的天猫产品评论,使用“检查”功能很难找到调用的网页地址。除此之外,有一些数据...</p>
</div>
</article>
<article class="article-list-1 clearfix">
<header class="clearfix">
<h1 class="post-title"><a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/">第四章 – 4.2 解析真实地址抓取</a></h1>
<div class="post-meta">
<span class="meta-span"><i class="far fa-calendar-alt"></i> 07月14日</span>
<span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
<span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/#respond">没有评论</a></span>
<span class="meta-span hidden-xs"><i class="fas fa-tags"></i> <a href="http://www.santostang.com/tag/ajax/" rel="tag">ajax</a>,<a href="http://www.santostang.com/tag/python/" rel="tag">python</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="tag">网络爬虫</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab/" rel="tag">网页爬虫</a>,<a href="http://www.santostang.com/tag/%e8%a7%a3%e6%9e%90%e5%9c%b0%e5%9d%80/" rel="tag">解析地址</a></span>
</div>
</header>
<div class="post-content clearfix">
<p>由于网易云跟帖停止服务,现在已经在此处中更新了新写的第四章。请参照文章:
4.2 解析真实地址抓取
虽然数据并没有出现在网页源代码中,我们也可以找到数据的真实地址,请求这个真实地址也可以获得想要的数据。...</p>
</div>
</article>
<article class="article-list-1 clearfix">
<header class="clearfix">
<h1 class="post-title"><a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/">第四章- 动态网页抓取 (解析真实地址 + selenium)</a></h1>
<div class="post-meta">
<span class="meta-span"><i class="far fa-calendar-alt"></i> 07月14日</span>
<span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
<span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/#respond">没有评论</a></span>
<span class="meta-span hidden-xs"><i class="fas fa-tags"></i> <a href="http://www.santostang.com/tag/ajax/" rel="tag">ajax</a>,<a href="http://www.santostang.com/tag/javascript/" rel="tag">javascript</a>,<a href="http://www.santostang.com/tag/python/" rel="tag">python</a>,<a href="http://www.santostang.com/tag/selenium/" rel="tag">selenium</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="tag">网络爬虫</a></span>
</div>
</header>
<div class="post-content clearfix">
<p>由于网易云跟帖停止服务,现在已经在此处中更新了新写的第四章。请参照文章:
前面爬取的网页均为静态网页,这样的网页在浏览器中展示的内容都在HTML源代码中。但是,由于主流网站都使用JavaScript展现网页内容,...</p>
</div>
</article>
<article class="article-list-1 clearfix">
<header class="clearfix">
<h1 class="post-title"><a href="http://www.santostang.com/2018/07/04/hello-world/">Hello world!</a></h1>
<div class="post-meta">
<span class="meta-span"><i class="far fa-calendar-alt"></i> 07月04日</span>
<span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
<span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/04/hello-world/#comments">1条评论</a></span>
<span class="meta-span hidden-xs"><i class="fas fa-tags"></i> </span>
</div>
</header>
<div class="post-content clearfix">
<p>Welcome to WordPress. This is your first post. Edit or delete it, then start writing!
各位读者,由于网易云跟帖在本书出版后已经停止服务,书中的第四章已经无法使用。所以我将本书的评论系统换成了来必力...</p>
</div>
</article>
<nav style="float:right">
</nav>
</div>
<div class="col-md-4 hidden-xs hidden-sm">
<aside class="widget clearfix">
<form id="searchform" action="http://www.santostang.com">
<div class="input-group">
<input type="search" class="form-control" placeholder="搜索…" value="" name="s">
<span class="input-group-btn"><button class="btn btn-default" type="submit"><i class="fas fa-search"></i></button></span>
</div>
</form>
</aside>
<aside class="widget clearfix">
<h4 class="widget-title">文章分类</h4>
<ul class="widget-cat">
<li class="cat-item cat-item-2"><a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" >Python 网络爬虫</a> (5)
</li>
</ul>
</aside>
<aside class="widget clearfix">
<h4 class="widget-title">热门文章</h4>
<ul class="widget-hot">
</ul>
</aside>
<aside class="widget clearfix">
<h4 class="widget-title">随机推荐</h4>
<ul class="widget-hot">
<li><a href="http://www.santostang.com/2018/07/04/hello-world/" title="Hello world!">Hello world!</a></li>
<li><a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/" title="第四章 – 4.2 解析真实地址抓取">第四章 – 4.2 解析真实地址抓取</a></li>
<li><a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/" title="第四章 – 4.3 通过selenium 模拟浏览器抓取">第四章 – 4.3 通过selenium 模拟浏览器抓取</a></li>
<li><a href="http://www.santostang.com/2018/07/11/%e3%80%8a%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%ef%bc%9a%e4%bb%8e%e5%85%a5%e9%97%a8%e5%88%b0%e5%ae%9e%e8%b7%b5%e3%80%8b%e4%b8%80%e4%b9%a6%e5%8b%98%e8%af%af/" title="《网络爬虫:从入门到实践》一书勘误">《网络爬虫:从入门到实践》一书勘误</a></li>
<li><a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/" title="第四章- 动态网页抓取 (解析真实地址 + selenium)">第四章- 动态网页抓取 (解析真实地址 + selenium)</a></li>
</ul>
</aside>
<aside class="widget clearfix">
<h4 class="widget-title">标签云</h4>
<div class="widget-tags">
<a href="http://www.santostang.com/tag/ajax/" class="tag-cloud-link tag-link-3 tag-link-position-1" style="color:#8c569b;font-size: 22pt;" aria-label="ajax (2个项目);">ajax</a>
<a href="http://www.santostang.com/tag/javascript/" class="tag-cloud-link tag-link-4 tag-link-position-2" style="color:#6263b3;font-size: 8pt;" aria-label="javascript (1个项目);">javascript</a>
<a href="http://www.santostang.com/tag/python/" class="tag-cloud-link tag-link-5 tag-link-position-3" style="color:#f7137;font-size: 22pt;" aria-label="python (2个项目);">python</a>
<a href="http://www.santostang.com/tag/selenium/" class="tag-cloud-link tag-link-6 tag-link-position-4" style="color:#a32540;font-size: 8pt;" aria-label="selenium (1个项目);">selenium</a>
<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" class="tag-cloud-link tag-link-8 tag-link-position-5" style="color:#e085d;font-size: 22pt;" aria-label="网络爬虫 (2个项目);">网络爬虫</a>
<a href="http://www.santostang.com/tag/%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab/" class="tag-cloud-link tag-link-9 tag-link-position-6" style="color:#987bc2;font-size: 8pt;" aria-label="网页爬虫 (1个项目);">网页爬虫</a>
<a href="http://www.santostang.com/tag/%e8%a7%a3%e6%9e%90%e5%9c%b0%e5%9d%80/" class="tag-cloud-link tag-link-10 tag-link-position-7" style="color:#91be70;font-size: 8pt;" aria-label="解析地址 (1个项目);">解析地址</a> </div>
</aside>
<aside class="widget clearfix">
<h4 class="widget-title">友情链接</h4>
<ul class="widget-links">
</ul>
</aside>
</div>
</div>
</div>
<div class="footer_search visible-xs visible-sm">
<form id="searchform" action="http://www.santostang.com">
<div class="input-group">
<input type="search" class="form-control" placeholder="搜索…" value="" name="s">
<span class="input-group-btn"><button class="btn btn-default" type="submit"><i class="fas fa-search"></i></button></span>
</div>
</form>
</div>
<footer id="footer">
<div class="copyright">
<p><i class="far fa-copyright"></i> 2019 <b>数据科学@唐松 | 粤ICP备19068356号</b></p>
<p>Powered by <b>WordPress</b>. Theme by <a href="https://tangjie.me/jiestyle-two" data-toggle="tooltip" data-placement="top" title="WordPress 主题模板" target="_blank"><b>JieStyle Two</b></a> | </p>
</div>
<div style="display:none;"></div>
</footer>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/jquery.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/bootstrap.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/skel.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/util.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/nav.js"></script>
<script type='text/javascript' src='http://www.santostang.com/wp-includes/js/jquery/jquery.js?ver=1.12.4'></script>
<script type='text/javascript' src='http://www.santostang.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.4.1'></script>
<script type='text/javascript' src='http://www.santostang.com/wp-content/plugins/captcha-bank/assets/global/plugins/custom/js/front-end-script.js?ver=4.9.10'></script>
<script>
$(function() {
$('[data-toggle="tooltip"]').tooltip()
});
</script>
<script>
(function(){
var bp = document.createElement('script');
var curProtocol = window.location.protocol.split(':')[0];
if (curProtocol === 'https') {
bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';
}
else {
bp.src = 'http://push.zhanzhang.baidu.com/push.js';
}
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(bp, s);
})();
</script>
<script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "https://hm.baidu.com/hm.js?752e310cec7906ba7afeb24cd7114c48";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
</body>
</html>
Process finished with exit code 0
接下来:
代码如下:
#!/usr/bin/python
# coding: UTF-8
import requests
from bs4 import BeautifulSoup
link = 'http://www.santostang.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;'
' rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers=headers)
# print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
title = soup.find("h1", class_="post-title").a.text
print (title)
print (title.strip())
第四章 – 4.3 通过selenium 模拟浏览器抓取
第四章 – 4.3 通过selenium 模拟浏览器抓取
chrome 浏览器:
http://www.w3cschool.cn/python/python-100-examples.html
#!/usr/bin/python
# coding: UTF-8
import requests
from bs4 import BeautifulSoup
link = 'http://www.santostang.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;'
' rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers=headers)
# print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
title = soup.find("h1", class_="post-title").a.text
print (title)
print (title.strip())
with open('title.txt', 'a+') as f:
f.write(title.strip())
f.close()
11111
第三章
http://httpbin.org/get?key1=value1
url: http://httpbin.org/get?key1=value1&key2=value2
text: {
"args": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0"
},
"origin": "124.128.63.186, 124.128.63.186",
"url": "https://httpbin.org/get?key1=value1&key2=value2"
}
code:
import requests
key_dict = {'key1': 'value1', "key2": "value2"}
r = requests.get('http://httpbin.org/get', params = key_dict)
print("url:", r.url)
print("text:", r.text)
https://movie.douban.com/top250
page 38
Provisional headers are shown
Referer: https://movie.douban.com/top250
Sec-Fetch-Mode: no-cors
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3824.6 Safari/537.36
代码:
import requests
from bs4 import BeautifulSoup
# https://movie.douban.com/top250
def get_movies():
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3824.6 '
'Safari/537.36',
'Host': 'movie.douban.com'
}
movie_list = []
for i in range(0, 10):
link = 'http://movie.douban.com/top250?start=' + str(i*25)
r = requests.get(link, headers=headers, timeout=10)
print("url:", r.url)
print(str(i+1), "status_code:", r.status_code)
# print(r.text)
soup = BeautifulSoup(r.text, 'lxml')
div_list = soup.find_all('div',class_='hd')
for each in div_list:
movie = each.a.span.text.strip()
movie_list.append(movie)
return movie_list
movies = get_movies()
print (movies)
第四章 :
page40
http://www.santostang.com/2018/07/04/hello-world/
AJAX (Asynchronous Javascript And XML)
异步JavaScript 和 XML
AJAX 加载 动态 网页
Selenium
https://www.anaconda.com/
page10
https://github.com/mozilla/geckodriver/releases
geckodriver
下载 firefox
如何打开.ipynb文件
https://blog.youkuaiyun.com/qq_16633405/article/details/80198648
最近碰到文件名后缀为.ipynb文件,起初没太在意这种文件格式,用Notepad++打开之后看到也是类似于JSON格式的信息,以为也是为其他的一些文件服务的(类似于配置一些HTML文件的配置文件)。但是后来才发现这也是一种文本表示形式,只不过需要特殊的工具才能打开展示(小菜鸟才疏学浅,之前连这种格式的文件都没见过。。。)
OK,废话少说,直接切入重点说下.ipynb文件的三种打开方式:
1,GitHub 中可以直接打开 .ipynb 文件。
2,可以把 .ipynb 文件对应的下载链接复制到 https://nbviewer.jupyter.org/ 中查看。
3,安装 Anaconda,从开始菜单中打开 jupyter notebook 的快捷方式(prompt 中用该命令打开同理),默认启动路径在 C:\Users\yourname 类似的文件夹。把 .ipynb 文件复制到这个目录下面,找到并打开即可查看。
这算是工作中的一些小插曲吧,有空的话会把最近遇到的小插曲全部总结一下,方便后来者少走弯路。
import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112403473268296510956_1531502963311&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1531502963313"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
print (r.text)
AJAX 代码 内容 如何 捕捉
import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112403473268296510956_1531502963311&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1531502963313"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
print (r.text)
# 获取 json 的 string
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
print(json_string)
# 从第一个左大括号提取,最后的两个字符 - 括号和分号不取
import json
json_data = json.loads(json_string)
comment_list = json_data['results']['parents']
for eachone in comment_list:
message = eachone['content']
print (message)
代码2 :
import requests
import json
def single_page_comment(link):
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers=headers)
# 获取 json 的 string
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
json_data = json.loads(json_string)
comment_list = json_data['results']['parents']
for eachone in comment_list:
message = eachone['content']
print(message)
for page in range(1, 4):
link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
page_str = str(page)
link = link1 + page_str + link2
print(link)
single_page_comment(link)
Selenium
code1:
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path = r'C:\ProgramData\Anaconda3\Scripts\geckodriver.exe')
driver.implicitly_wait(20) # 隐性等待,最长等20秒
#把上述地址改成你电脑中geckodriver.exe程序的地址
driver.get("http://www.santostang.com/2018/07/04/hello-world/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
comment = driver.find_element_by_css_selector('div.reply-content')
content = comment.find_element_by_tag_name('p')
print (content.text)
code2 :
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path = r'C:\ProgramData\Anaconda3\Scripts\geckodriver.exe')
driver.implicitly_wait(20) # 隐性等待,最长等20秒
#把上述地址改成你电脑中geckodriver.exe程序的地址
driver.get("http://www.santostang.com/2018/07/04/hello-world/")
time.sleep(5)
for i in range(0,3):
# 下滑到页面底部
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# 转换iframe,再找到查看更多,点击
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
load_more = driver.find_element_by_css_selector('button.more-btn')
load_more.click()
# 把iframe又转回去
driver.switch_to.default_content()
time.sleep(2)
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
content = eachcomment.find_element_by_tag_name('p')
print (content.text)
https://zh.airbnb.com/s/Shenzhen–China?page=1
第五章 解析网页
Regular Expression 101 网站学习,正则表达式
网站地址:
https://regex101.com/
(稍后补充)