python爬虫设计刷博客访问量（刷访问量，赞，爬取图片）

最新推荐文章于 2021-07-24 21:13:15 发布

原创

最新推荐文章于 2021-07-24 21:13:15 发布 · 2.5k 阅读

16 ·

CC 4.0 BY-SA版权

本文介绍了使用Python爬虫刷博客访问量的原理和方法，包括利用Fiddler抓包获取请求头数据以及通过正则表达式爬取优快云博客的访问量和图片。此外，还分享了自动发送QQ消息的脚本，并提到了Markdown编辑器的新功能，如图片拖拽、KaTeX数学公式和甘特图支持。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.youkuaiyun.com/jiangjunshow

也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！

需要准备的工具：

安装python软件，下载地址：https://www.python.org/

Fiddler抓包软件：http://blog.youkuaiyun.com/qq_21792169/article/details/51628123

刷博客访问量的原理是：打开一次网页博客访问量就增加一次。（新浪，搜狐等博客满足这个要求）

count.py

<span style="font-size:18px;">import webbrowser as web  import time  import os  import random  count = random.randint(1,2)  j=0  while j<count:      i=0      while i<=8 :          web.open_new_tab('http://blog.sina.com.cn/s/blog_552d7c620100aguu.html')  #网址替换这里        i=i+1          time.sleep(3)  #这个时间根据自己电脑处理速度设置，单位是s    else:          time.sleep(10)  <span style="font-family: Arial, Helvetica, sans-serif;">#这个时间根据自己电脑处理速度设置，单位是s</span>        os.system('taskkill /F /IM chrome.exe')  #google浏览器，其他的更换下就行        #print 'time webbrower closed'            j=j+1  </span>

刷赞就需要用Fiddler来获取Request header数据，比如Cookie,Host,Referer,User-Agent等

sina.py

<span style="font-size:18px;">import urllib.requestimport syspoints = 2   #how count ?if len(sys.argv) > 1:    points = int(sys.argv[1])aritcleUrl = ''point_header = {    'Accept' : '*/*',    'Cookie' :  '',#填你的cookie信息    'Host':'',  #主机    'Referer' : '',    'User-Agent' : 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36',}for i in range(points):    point_request = urllib.request.Request(aritcleUrl, headers = point_header)    point_response = urllib.request.urlopen(point_request)</span>

上面的header头通过抓包数据可以获取，这里只是提供思路。

爬取网页上的图片：

getimg.py

#coding=utf-8import urllibimport urllib2import redef getHtml(url): headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} req = urllib2.Request(url,headers=headers) page = urllib2.urlopen(req); html = page.read() return htmldef getImg(html):     reg = r'src="(h.*?g)"'    #reg = r'<img src="(.+?\.jpg)"'    imgre = re.compile(reg)    imglist = re.findall(imgre,html)    print imglist    x = 0    for imgurl in imglist:        urllib.urlretrieve(imgurl,'%s.jpg' % x)        x+=1html = getHtml("http://pic.yxdown.com/list/0_0_1.html")print getImg(html)

1、 .*? 三个符号可以匹配任意多个任意符号

2、 \. 是将 ‘.’ 转义，代表的就是HTML中的 .

3、（）表示我们只取括号中的部分，省略之外的。

爬取优快云的访问量csdn.py

[html] view plain copy print ?

<code class="language-html">#!usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
import re
#当前的博客列表页号
page_num = 1
#不是最后列表的一页
notLast = 1
fs = open('blogs.txt','w')
account = str(raw_input('Input csdn Account:'))
while notLast:
#首页地址
baseUrl = 'http://blog.youkuaiyun.com/'+account
#连接页号，组成爬取的页面网址
myUrl = baseUrl+'/article/list/'+str(page_num)
#伪装成浏览器访问，直接访问的话csdn会拒绝
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
#构造请求
req = urllib2.Request(myUrl,headers=headers)
#访问页面
myResponse = urllib2.urlopen(req)
myPage = myResponse.read()
#在页面中查找是否存在‘尾页’这一个标签来判断是否为最后一页
notLast = re.findall('<a href=".*?">尾页</a>',myPage,re.S)
print '-----------------------------第%d页---------------------------------' % (page_num,)
fs.write('--------------------------------第%d页--------------------------------\n' % page_num)
#利用正则表达式来获取博客的href
title_href = re.findall('<span class="link_title"><a href="(.*?)">',myPage,re.S)
titleListhref=[]
for items in title_href:
titleListhref.append(str(items).lstrip().rstrip())
#利用正则表达式来获取博客的
title= re.findall('<span class="link_title"><a href=".*?">(.*?)</a></span>',myPage,re.S)
titleList=[]
for items in title:
titleList.append(str(items).lstrip().rstrip())
#利用正则表达式获取博客的访问量
view = re.findall('<span class="link_view".*?><a href=".*?" title=