python百度贴吧爬虫，以及爬虫简要入门_python爬虫前提知识-优快云博客

本文链接：https://blog.youkuaiyun.com/Kindle_code/article/details/46974373

本文为Python爬虫初学者提供了实用建议，包括良好编程习惯、正则表达式使用技巧及实战案例，如抓取贴吧页面信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一点点意见：对于刚刚开始学习爬虫的小伙伴们，小编不建议学完python直接就看scrapy等框架，小编基础比较差，所以看框架真是看的零零碎碎的。其实一般来说初学者用本身的库，才能更好的体会python爬虫的精髓。

前提知识

1.写python养成良好的习惯开头加上这几句：

#_*_ encoding=utf-8 _*_
#!usr/bin/env python

2.爬虫最重要的就是将抓下的源码做匹配，找到自己需要的东西，我们主要用的是正则表达式，初学时建议完全使用正则表达式，可能会听说别人说有神器，是滴，还有xpath，beautifulsoup这些工具，后两者可以去熟悉下，哪个简单就用哪个。这里提供一个正则学习的工具网站：http://blog.youkuaiyun.com/pleasecallmewhy/article/details/8929576

3.当然python中如果没有搞清元组，列表，字典的话，python基础要牢靠。学习python建议网站：http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/

4.学会写注释，相信不管什么语言，注释的重要性，大伙儿应该都知道。

爬虫开始

抓取网页源代码

import  requests
#这里爬取的是自己学校的网站
url = http://tieba.baidu.com/f?kw=%E5%8D%97%E5%8D%8E%E5%A4%A7%E5%AD%A6&ie=utf-8&pn=1
html = requests.get(url)

好啦，简单的几句就将贴吧首页所有的源码给抓下来了，别高兴的太早，最重要的事情才刚刚开始，第一件所要做的事情就是分析网页源代码。小编发现，每一个帖子的链接都很相似，都是形如：http://tieba.baidu.com/p/3802073149?pn=1 这样的链接，一串数字就是用户id，而 pn=1说明是第一页。
所以我们观察首页的源码，可以用正则来匹配出第一页所有的链接上面的id号，再将相同的部分加入存放到一个列表就OK啦。
可以观察到每一页又30个id链接，抓取前10页的id链接，一个300个id链接，下面是源码：

#_*_ encoding=utf-8 _*_
#!usr/bin/env python

#__author__ = 'Mr Cai'

#正则导入
import  re
import  requests

#返回每页的链接
def getPage(first):
    url = []
    for i in range(0,451,50):
        url.append(first + str(i))
    return  url

#获取每页上的id链接
def getId(url_list):

    #模拟浏览器，访问链接，用来对付反爬虫
    headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36'}
    Id_list = []
    id_temp = []
    #一页一页的获取id
    for url in url_list:
        html = requests.get(url,headers = headers)#获取该网页源码
        html.encoding = 'utf-8'#此网页用的是utf-8编码
        id_temp = re.findall(r';,&quot;id&quot;:(.*?),&quot;',html.content,re.S)#正则匹配id号
        Id_list += id_temp
    #将id号转换成链接保存起来
    for i in range(0,len(Id_list)):
        Id_list[i] = 'http://tieba.baidu.com/p/' + Id_list[i]

    return Id_list  #返回每一个帖子的链接

if __name__ == '__main__':
    first = "http://tieba.baidu.com/f?kw=%E5%8D%97%E5%8D%8E%E5%A4%A7%E5%AD%A6&ie=utf-8&pn="#每页链接相似的链接部分
    allPage = getPage(first)#整合前十页的链接存储到列表，并返回
    Id_url = getId(allPage)     #获取所有帖子的链接
    print Id_url

抓取单个贴吧信息

#encoding=utf-8
#!usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
type = sys.getfilesystemencoding()

import  requests
from lxml import  etree
import  string
import  re


def getsource(url):
    headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36'}
    html = requests.get(url,headers = headers)
    html.encoding = 'utf-8'
    return  html.content

def getValue(html):

    selector = etree.HTML(html)
    #用于存储发帖时间
    time = []
    #每个回帖的内容
    content = []
    #用于计算一页有多少帖子(主要对于一个不满30的情况)
    count = len(re.findall(r'<div class="l_post l_post_bright j_l_post clearfix"',html,re.S))

    #帖子内容列表(不能用！)
    # content = selector.xpath('//*[starts-with(@id,"post_content_")]/text()')#第一次用神器啊(出问题了)

    #先抓大
    content_list = re.findall(r'<div id="post_content.*?</div>',html,re.S)
    #用sub将<>里面的东西替换成空格，就能完全呈现实体内容了
    for con in content_list:
        content.append(re.sub(r'<[^>]+>',' ',con))

    #时间获取
    #先大
    time_list = re.findall(r'<div class="post-tail-wrap">.*?</div>',html,re.S)
    #时间获取
    for list in time_list:
        time_selector = etree.HTML(list)
        temp = time_selector.xpath('//*[@class="tail-info"]/text()')
        time.append(temp[-1])

    #show一下
    for i in range(0,30):
        print content[i]
        print time[i]
        print '~~~~~~~~~~~~~~~~~~~~~~~'

    # f = open('show.txt','w+')
    # for con in time:
    #      f.writelines(con + '\n')
    # f.close()

if __name__ == '__main__':
    url = 'http://tieba.baidu.com/p/3802073149'
    html = getsource(url)
    getValue(html)