python案例3

最新推荐文章于 2025-08-09 10:17:13 发布

原创最新推荐文章于 2025-08-09 10:17:13 发布 · 213 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python

python&nodejs 专栏收录该内容

311 篇文章

订阅专栏

python

Python使用cookielib和urllib2模拟登陆新浪微博并抓取数据

2012/07/04 by crazyant 暂无评论

我们都知道HTTP是无连接的状态协议，但是客户端和服务器端需要保持一些相互信息，比如cookie，有了cookie，服务器才能知道刚才是这个用户登录了网站，才会给予客户端访问一些页面的权限。

用浏览器登录新浪微博，必须先登录，登陆成功后，打开其他的网页才能够访问。用程序登录新浪微博或其他验证网站，关键点也在于需要保存cookie，之后附带cookie再来访问网站，才能够达到效果。

这里就需要Python的cookielib和urllib2等的配合，将cookielib绑定到urllib2在一起，就能够在请求网页的时候附带cookie。

具体做法，首先第一步，用firefox的httpfox插件，在浏览器衷开始浏览新浪微博首页，然后登陆，从httpfox的记录中，查看每一步发送了那些数据请求了那个URL；之后再python里面，模拟这个过程，用urllib2.urlopen发送用户名密码到登陆页面，获取登陆后的cookie，之后访问其他页面，获取微博数据。

具体代码，来自豆瓣的一篇文章：地址

本人加了点注释，欢迎大家一起品尝该同学的完美代码：

      
           1
          
           2
          
           3
          
           4
          
           5
          
           6
          
           7
          
           8
          
           9
          
           10
          
           11
          
           12
          
           13
          
           14
          
           15
          
           16
          
           17
          
           18
          
           19
          
           20
          
           21
          
           22
          
           23
          
           24
          
           25
          
           26
          
           27
          
           28
          
           29
          
           30
          
           31
          
           32
          
           33
          
           34
          
           35
          
           36
          
           37
          
           38
          
           39
          
           40
          
           41
          
           42
          
           43
          
           44
          
           45
          
           46
          
           47
          
           48
          
           49
          
           50
          
           51
          
           52
          
           53
          
           54
          
           55
          
           56
          
           57
          
           58
          
           59
          
           60
          
           61
          
           62
          
           63
          
           64
          
           65
          
           66
          
           67
          
           68
          
           69
          
           70
          
           71
          
           72
          
           73
          
           74
          
           75
          
           76
          
           77
          
           78
          
           79
          
           80
          
           81
          
           82
          
           83
          
           84
          
           85
          
           86
          
           87
          
           88
          
           89
          
           90
          
           91
          
           92
          
           93
          
           94
          
           95
          
           96
          
           97
          
           98
          
           99
          
           100
          
           101
          
           102
          
           103
          
           104
          
           105
          
           106
          
           107
          
           # coding=utf8
          
           import 
           urllib 
          
           import 
           urllib2 
          
           import 
           cookielib 
          
           import 
           base64 
          
           import 
           re 
          
           import 
           json 
          
           import 
           hashlib 
          
           # 获取一个保存cookie的对象
          
           cj
           =
           cookielib
           .
           LWPCookieJar
           (
           ) 
          
           # 将一个保存cookie对象，和一个HTTP的cookie的处理器绑定
          
           cookie_support
           =
           urllib2
           .
           HTTPCookieProcessor
           (
           cj
           ) 
          
           # 创建一个opener，将保存了cookie的http处理器，还有设置一个handler用于处理http的URL的打开
          
           opener
           =
           urllib2
           .
           build_opener
           (
           cookie_support
           ,
           urllib2
           .
           HTTPHandler
           ) 
          
           # 将包含了cookie、http处理器、http的handler的资源和urllib2对象板顶在一起
          
           urllib2
           .
           install_opener
           (
           opener
           ) 
          
           postdata
           =
           { 
          
           'entry'
           :
           'weibo'
           , 
          
           'gateway'
           :
           '1'
           , 
          
           'from'
           :
           ''
           , 
          
           'savestate'
           :
           '7'
           , 
          
           'userticket'
           :
           '1'
           , 
          
           'ssosimplelogin'
           :
           '1'
           , 
          
           'vsnf'
           :
           '1'
           , 
          
           'vsnval'
           :
           ''
           , 
          
           'su'
           :
           ''
           , 
          
           'service'
           :
           'miniblog'
           , 
          
           'servertime'
           :
           ''
           , 
          
           'nonce'
           :
           ''
           , 
          
           'pwencode'
           :
           'wsse'
           , 
          
           'sp'
           :
           ''
           , 
          
           'encoding'
           :
           'UTF-8'
           , 
          
           'url'
           :
           'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack'
           , 
          
           'returntype'
           :
           'META' 
          
           }
          
           def 
           get_servertime
           (
           )
           : 
          
           url
           =
           'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=dW5kZWZpbmVk&client=ssologin.js(v1.3.18)&_=1329806375939' 
          
           data
           =
           urllib2
           .
           urlopen
           (
           url
           )
           .
           read
           (
           ) 
          
           p
           =
           re
           .
           compile
           (
           '\((.*)\)'
           ) 
          
           try
           : 
          
           json_data
           =
           p
           .
           search
           (
           data
           )
           .
           group
           (
           1
           ) 
          
           data
           =
           json
           .
           loads
           (
           json_data
           ) 
          
           servertime
           =
           str
           (
           data
           [
           'servertime'
           ]
           ) 
          
           nonce
           =
           data
           [
           'nonce'
           ] 
          
           return
           servertime
           ,
           nonce 
          
           except
           : 
          
           print
           'Get severtime error!' 
          
           return
           None 
          
           def 
           get_pwd
           (
           pwd
           ,
           servertime
           ,
           nonce
           )
           : 
          
           pwd1
           =
           hashlib
           .
           sha1
           (
           pwd
           )
           .
           hexdigest
           (
           ) 
          
           pwd2
           =
           hashlib
           .
           sha1
           (
           pwd1
           )
           .
           hexdigest
           (
           ) 
          
           pwd3_
           =
           pwd2
           +
           servertime
           +
           nonce 
          
           pwd3
           =
           hashlib
           .
           sha1
           (
           pwd3_
           )
           .
           hexdigest
           (
           ) 
          
           return
           pwd3 
          
           def 
           get_user
           (
           username
           )
           : 
          
           username_
           =
           urllib
           .
           quote
           (
           username
           ) 
          
           username
           =
           base64
           .
           encodestring
           (
           username_
           )
           [
           :
           -
           1
           ] 
          
           return
           username 
          
           def 
           main
           (
           )
           : 
          
           username
           =
           'www.crazyant.net'
             
           # 微博账号 
          
           pwd
           =
           'xxxx'
             
           # 微博密码 
          
           url
           =
           'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.3.18)' 
          
           try
           : 
          
           servertime
           ,
           nonce
           =
           get_servertime
           (
           ) 
          
           except
           : 
          
           return 
          
           global
           postdata 
          
           postdata
           [
           'servertime'
           ]
           =
           servertime 
          
           postdata
           [
           'nonce'
           ]
           =
           nonce 
          
           postdata
           [
           'su'
           ]
           =
           get_user
           (
           username
           ) 
          
           postdata
           [
           'sp'
           ]
           =
           get_pwd
           (
           pwd
           ,
           servertime
           ,
           nonce
           ) 
          
           postdata
           =
           urllib
           .
           urlencode
           (
           postdata
           ) 
          
           headers
           =
           {
           'User-Agent'
           :
           'Mozilla/5.0 (X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0'
           } 
          
           # 其实到了这里，已经能够使用urllib2请求新浪任何的内容了，这里已经登陆成功了 
          
           req
           =
           urllib2
           .
           Request
           ( 
          
           url
           =
           url
           , 
          
           data
           =
           postdata
           , 
          
           headers
           =
           headers 
          
           ) 
          
           result
           =
           urllib2
           .
           urlopen
           (
           req
           ) 
          
           text
           =
           result
           .
           read
           (
           ) 
          
           # print text 
          
           p
           =
           re
           .
           compile
           (
           'location\.replace\(\'(.*?)\'\)'
           ) 
          
           try
           : 
          
           login_url
           =
           p
           .
           search
           (
           text
           )
           .
           group
           (
           1
           ) 
          
           print 
           login_url 
          
           # print login_url 
          
           urllib2
           .
           urlopen
           (
           login_url
           ) 
          
           print
           "login success" 
          
           except
           : 
          
           print
           'Login error!' 
          
           # 测试读取数据，下面的URL，可以换成任意的地址，都能把内容读取下来 
          
           req
           =
           urllib2
           .
           Request
           (
           url
           =
           'http://e.weibo.com/aj/mblog/mbloglist?page=1&count=15&max_id=3463810566724276&pre_page=1&end_id=3458270641877724&pagebar=1&_k=134138430655960&uid=2383944094&_t=0&__rnd=1341384513840'
           ,
           ) 
          
           result
           =
           urllib2
           .
           urlopen
           (
           req
           ) 
          
           text
           =
           result
           .
           read
           (
           ) 
          
           print 
           len
           (
           result
           .
           read
           (
           )
           ) 
          
           # unicode(eval(b),"utf-8") 
          
           print 
           eval
           (
           "u'''"
           +
           text
           +
           "'''"
           ) 
          
           main
           (
           )

其实获取了模拟登陆后的urllib2，可以做抓数据等任何事情，你甚至可以写一个多线程的爬虫来爬遍新浪微博，我一直有这个想法，可从来没有实现。如果您有什么进展，请联系我共同进步。

Posted in: python Tagged: 数据抓取, 数据采集, 模拟登陆

Ubuntu 安装 PostgreSQL 和 python-psycopg2基础教程（以及错误解决）

2012/06/27 by Crazyant 3条评论

Django支持以下四种数据库PostgreSQL（pgql）、SQLite 3、MySQL、Oracle。PostgreSQL 和 MySQL都是最受人关注的开源数据库，MySQL在国内又相对盛行，这和php领域大力推崇lamp不无关系；关于Mysql和PostgreSQL的对比网上有很多版本，也没必要去比较，不过可以确定的一点是PostgreSQL对Django的 GIS支持更加强大。在Ubuntu 系统下为Python Django安装 PostgreSQL 数据库，还包括pgadmin3 和 python-psycopg2 等。

安装PostgreSQL 数据库

sudo apt-get install postgresql postgresql-client postgresql-contrib

安装过程提示：

The following NEW packages will be installed:
libossp-uuid16 libpq5 postgresql postgresql-8.4 postgresql-client
postgresql-client-8.4 postgresql-client-common postgresql-common
postgresql-contrib postgresql-contrib-8.4
……
Adding user postgres to group ssl-cert
……
Creating new cluster (configuration: /etc/postgresql/8.4/main, data: /var/lib/postgresql/8.4/main)…
Moving configuration file /var/lib/postgresql/8.4/main/postgresql.conf to /etc/postgresql/8.4/main…
Moving configuration file /var/lib/postgresql/8.4/main/pg_hba.conf to /etc/postgresql/8.4/main…
Moving configuration file /var/lib/postgresql/8.4/main/pg_ident.conf to /etc/postgresql/8.4/main…
Configuring postgresql.conf to use port 5432…
……
* Starting PostgreSQL 8.4 database server [ OK ]
Setting up postgresql (8.4.8-0ubuntu0.11.04) …
Setting up postgresql-client (8.4.8-0ubuntu0.11.04) …
Setting up postgresql-contrib-8.4 (8.4.8-0ubuntu0.11.04) …
Setting up postgresql-contrib (8.4.8-0ubuntu0.11.04) …
Processing triggers for libc-bin …

即创建了配置文件的位置为：/etc/postgresql/8.4/main/
可执行程序为：

sudo /etc/init.d/postgresql {start|stop|restart|reload|force-reload|status} [Read more…]

Posted in: python

eclipse远程发布代码的方法（SSH自动同步）

2012/06/27 by Crazyant 1条评论

eclipse有个插件，叫做Eclipse Remote System Explorer (RSE)，具体使用方法：

1、下载RSE

地址：http://download.eclipse.org/tm/downloads/drops/R-3.3.2-201202061000/

2、安装到eclipse（3.4版本以上）

解压RSE压缩包，直接把里面的内容复制到eclipse的根目录

3、打开eclipse

新建-》项目-》RSE-》connection
填入IP，名字

4、将eclipse的工作目录，切换到RSE

5、右键新建一个connection，然后输入IP地址

6、右键连接，然后输入用户名和密码，同步完成

7、在sftp files里面新建一个filter，里面输入过滤的文件夹路径，比如/home/crazyant

最后左边目录树，会出现服务器上相应的文件夹，可以直接编辑了

注：本文有www.crazyant.net原创，转载请注明出处。

Posted in: python Tagged: eclipse, python

python在linux下安装方法（解决旧版本冲突）

2012/06/26 by Crazyant 暂无评论

1．下载源代码 http://www.python.org/ftp/python/2.5.2/Python-2.5.2.tar.bz2

2．安装

$ tar –jxvf Python-2.5.2.tar.bz2

$ cd Python-2.5.2

$ ./configure

$ make

$ make install

3. 测试

在命令行下输入python，出现python解释器即表示已经正确安装。

在suse10或rhel5（es5）下系统默认已经装了python但版本是2.4.x；本次安装后在shell中输入

#python

会发现显示结果：

# python

Python 2.4.3 (#1, Dec 11 2006, 11:38:52)

[GCC 4.1.1 20061130 (Red Hat 4.1.1-43)] on linux2

Type “help”, “copyright”, “credits” or “license” for more information.

>>>

版本还是2.4.x的

解决办法：

#cd /usr/bin

#ll |grep python //查看该目录下python

#rm -rf python

#ln -s PREFIX/Python-2.5.2/python ./python //PREFIX为你解压python的目录

#python

# python

Python 2.5.2 (#1, Dec 11 2006, 11:38:52)

[GCC 4.1.1 20061130 (Red Hat 4.1.1-43)] on linux2

Type “help”, “copyright”, “credits” or “license” for more information.

>>>

OK！问题解决！

Posted in: python Tagged: python

Python关于apply的知识

2012/06/10 by Crazyant 6条评论

今天用到了python apply的方法，感觉非常的好用。

python apply函数的具体的含义：

apply(function, args[, keywords])

函数用于当函数参数已经存在于一个元组或字典中时，间接地调用函数。args是一个包含将要提供给函数的按位置传递的参数的元组。如果省略了args，任何参数都不会被传递，kwargs是一个包含关键字参数的字典。

apply()的返回值就是func()的返回值，apply()的元祖参数是有序的，元素的顺序必须和func()形式参数的顺序一致，下面给几个例子来详细的说下:

假设是执行没有带参数的方法

def say():
print ‘say in’
apply(say)

输出的结果是’say in’

函数只带元组的参数。

def say(a, b):
print a, b
apply(say,(“hello”, “老王python”))

函数带关键字参数。

def say(a=1,b=2):
    print a,b
def haha(**kw):
     # say(kw)
      apply(say,(),kw)
print haha(a=’a’,b=’b’)

输出的结果是:a,b

下面有个例子是apply的经典运用，他可以让你少写一些代码，多点时间陪陪朋友
地址是：
http://bbs.cnpythoner.com/viewthread.php?tid=139&extra=

该函数从2.3已经弃用，被call替代

Posted in: python

Python知识之什么是*args和**kwargs？

2012/06/10 by Crazyant 暂无评论

先来看个例子：

       
     

       
     
 
      
            1
           

            2
           

            3
           

            4
           

            5
           

            6
           

            7
           

            8
           

            9
           

            10
           
 
            def 
            foo
            (
            *
            args
            ,
            *
            *
            kwargs
            )
            : 
           
 
                
            print
            'args = '
            ,
            args 
           
 
                
            print
            'kwargs = '
            ,
            kwargs 
           
 
                
            print
            '---------------------------------------' 
           

             
           
 
            if
            __name__
            ==
            '__main__'
            : 
           
 
                
            foo
            (
            1
            ,
            2
            ,
            3
            ,
            4
            ) 
           
 
                
            foo
            (
            a
            =
            1
            ,
            b
            =
            2
            ,
            c
            =
            3
            ) 
           
 
                
            foo
            (
            1
            ,
            2
            ,
            3
            ,
            4
            ,
            a
            =
            1
            ,
            b
            =
            2
            ,
            c
            =
            3
            ) 
           
 
                
            foo
            (
            'a'
            ,
            1
            ,
            None
            ,
            a
            =
            1
            ,
            b
            =
            '2'
            ,
            c
            =
            3
            ) 
           

     

输出结果如下：

可以看到，这两个是python中的可变参数。*args表示任何多个无名参数，它是一个tuple；**kwargs表示关键字参数，它是一个dict。并且同时使用*args和**kwargs时，必须*args参数列要在**kwargs前，像foo(a=1, b=’2′, c=3, a’, 1, None, )这样调用的话，会提示语法错误“SyntaxError: non-keyword arg after keyword arg”。

呵呵，知道*args和**kwargs是什么了吧。还有一个很漂亮的用法，就是创建字典：

其实python中就带有dict类，使用dict(a=1,b=2,c=3)即可创建一个字典了。

Posted in: python

Python中的操作符重载

2012/06/09 by Crazyant 2条评论

关于Python中的操作符重载，可以查看2.7.3文档地址：

http://docs.python.org/reference/datamodel.html#special-method-names

或者一篇详细的中文教程：

Python 魔术方法指南

类可以重载python的操作符，操作符重载使我们的对象与内置的一样。__X__的名字的方法是特殊的挂钩（hook），python通过这种特殊的命名来拦截操作符，以实现重载。 python在计算操作符时会自动调用这样的方法，例如：如果对象继承了__add__方法，当它出现在+表达式中时会调用这个方法。通过重载，用户定义的对象就像内置的一样。

在类中重载操作符

操作符重载使得类能拦截标准的python操作。
类可以重载所有的python的表达式操作符。
类可以重载对象操作：print,函数调用，限定等。
重载使得类的实例看起来更像内置的。
重载是通过特殊命名的类方法来实现的。 [Read more…]

Posted in: python Tagged: 数据采集

数据采集技术之在Python中Libxml模块安装与使用XPath

2012/06/09 by Crazyant 7条评论

为了使用XPath技术，对爬虫抓取的网页数据进行抽取（如标题、正文等等），之后在Windows下安装libxml2模块（安装后使用的是Libxml模块），该模块含有xpath。

准备

需要的软件包：

Python 2.7
lxml-2.3.4.win32-py2.7.‌exe 安装最好使用已打包的exe，这个包可以自动安装好lxml来使用

安装

Python2.7的安装这里不再赘述

lxml的安装，直接运行exe，会自动找到py27的目录进行安装

使用XPath抽取

下面用一个实例来验证，程序来自redice’s Blog的文章：

libxml2库的安装，xpath的使用

      
           1
          
           2
          
           3
          
           4
          
           5
          
           6
          
           7
          
           8
          
           9
          
           10
          
           11
          
           12
          
           13
          
           14
          
           15
          
           16
          
           17
          
           18
          
           19
          
           20
          
           21
          
           22
          
           23
          
           24
          
           25
          
           26
          
           27
          
           28
          
           29
          
           #coding:utf-8
          
           import 
           codecs 
          
           import 
           sys 
          
           #不加如下行，无法打印Unicode字符，产生UnicodeEncodeError错误。?
          
           sys
           .
           stdout
           =
           codecs
           .
           lookup
           (
           'iso8859-1'
           )
           [
           -
           1
           ]
           (
           sys
           .
           stdout
           ) 
          
           from 
           lxml 
           import 
           etree 
          
           html
           =
           r
           ''
           '<div> 
          
               <div>redice</div>
          
               <div id="email">redice@163.com</div>
          
               <div name="address">中国</div>
          
               <div>http://www.redicecn.com</div>
          
           </div>'
           '' 
          
           tree
           =
           etree
           .
           HTML
           (
           html
           ) 
          
           #获取email。email所在的div的id为email
          
           nodes
           =
           tree
           .
           xpath
           (
           "//div[@id='email']"
           ) 
          
           print 
           nodes
           [
           0
           ]
           .
           text 
          
           #获取地址。地址所在的div的name为address
          
           nodes
           =
           tree
           .
           xpath
           (
           "//div[@name='address']"
           ) 
          
           print 
           nodes
           [
           0
           ]
           .
           text 
          
           #获取博客地址。博客地址位于email之后兄弟节点的第二个
          
           nodes
           =
           tree
           .
           xpath
           (
           "//div[@id='email']/following-sibling::div[2]"
           ) 
          
           print 
           nodes
           [
           0
           ]
           .
           text

运行结果：

redice@163.com
中国
http://www.redicecn.com

Posted in: python

Python操作Mysql实例代码教程（查询手册）

2012/06/08 by Crazyant 暂无评论

本文介绍了Python操作MYSQL、执行SQL语句、获取结果集、遍历结果集、取得某个字段、获取表字段名、将图片插入数据库、执行事务等各种代码实例和详细介绍，代码居多，是一桌丰盛唯美的代码大餐。

实例1、取得MYSQL的版本

在windows环境下安装mysql模块用于python开发，请见我的另一篇文章：

MySQL-python Windows下EXE安装文件下载

      
           1
          
           2
          
           3
          
           4
          
           5
          
           6
          
           7
          
           8
          
           9
          
           10
          
           11
          
           12
          
           13
          
           14
          
           15
          
           16
          
           17
          
           18
          
           19
          
           20
          
           21
          
           22
          
           23
          
           24
          
           25
          
           26
          
           # -*- coding: UTF-8 -*-
          
           # 安装MYSQL DB for python
          
           import 
           MySQLdb 
           as
           mdb 
          
           con
           =
           None 
          
           try
           : 
          
           # 连接mysql的方法：connect('ip','user','password','dbname') 
          
           con
           =
           mdb
           .
           connect
           (
           'localhost'
           ,
           'root'
           , 
          
           'root'
           ,
           'test'
           )
           ; 
          
           # 所有的查询，都在连接con的一个模块cursor上面运行的 
          
           cur
           =
           con
           .
           cursor
           (
           ) 
          
           # 执行一个查询 
          
           cur
           .
           execute
           (
           "SELECT VERSION()"
           ) 
          
           # 取得上个查询的结果，是单个结果 
          
           data
           =
           cur
           .
           fetchone
           (
           ) 
          
           print
           "Database version : %s "
           %
           data 
          
           finally
           : 
          
           if
           con
           : 
          
           # 无论如何，连接记得关闭 
          
           con
           .
           close
           (
           )

执行结果：

Database version : 5.5.25

[Read more…]