Python与网页相关的操作集锦

本文介绍了使用Python进行网页内容抓取的方法,包括GET请求获取网页内容及POST请求模拟表单提交。此外还展示了如何利用Python和IE组件在Windows环境下获取包含Ajax加载内容的网页源代码。
<p>1、从网址中获取网页内容</p>
<p></p>
<p>import urllib<br>
import re<br>
import sys<br>
import string<br><br>
sock = urllib.urlopen("http://www.hao123.com/")<br>
strhtml = sock.read()<br>
strhtml = unicode(strhtml, 'gb2312','ignore').encode('utf-8','ignore')<br>
print(strhtml)</p>
<p>转载自:http://hi.baidu.com/kopla/blog/item/591335afde167ce8fbed505a.html</p>
<p>这个博客有不少python从网页获取内容的东东,很好</p>
<p></p>
<p>2、POST请求</p>
<p></p>
<p>以前在实现论坛自动发贴的时候写的代码,其中data部分就是你要提交的数据。<br>
其实最好的方式就是你自己在提交一次注册信息的时候,抓包看一下post了什么东西,然后把data部分改成你要提交的东西,注意报文格式,就可以了。<br>
#!/usr/bin/python<br><br>
import cookielib, urllib2, urllib, sys, time<br>
import httplib<br><br><br>
http = httplib.HTTP('你要连接的host')<br><br>
# write header<br>
http.putrequest("POST", '/phpwind/post.php?')<br>
http.putheader("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.2; MAXTHON 2.0)")<br>
http.putheader("Referer", 'http://10.16.62.100/phpwind/post.php?fid=2')<br>
http.putheader("Host", '10.16.62.100')<br>
http.putheader("Cookie", cookie)<br>
http.putheader("Content-Type", 'multipart/form-data; boundary=---------------------------7d91d42da0af0')<br>
http.putheader("Content-Length", str(len(data)))<br>
http.endheaders()<br><br>
# write body<br>
http.send(data)<br><br>
# get response<br>
errcode, errmsg, headers = http.getreply()<br><br>
if errcode != 200:<br>
raise Error(errcode, errmsg, headers)<br>
file = http.getfile()<br>
print file.read()</p>
<p></p>
<p>转载自:http://topic.youkuaiyun.com/u/20101012/14/51a74db4-fad7-4d05-ba64-69f6b0149a44.html</p>
<p></p>
<p>http://pleac.sourceforge.net/pleac_python/webautomation.html</p>
<p></p>
<p>示例:</p>
<p>使用python在win下通过IE组件获得Ajax执行后网页源代码</p>
<p></p>
<p>在论坛发帖都没人回,最后还是自己解决了,现在把测试的代码写下。距离色色的目标又进一步了,欧耶!<br><br>
view plaincopy to clipboardprint?<br><br>
1. #!/usr/bin/env python <br>
2. #coding=utf-8 <br>
3. import wx.lib.iewin <br>
4. import wx,time <br>
5. class MyFrame(wx.Frame): <br>
6. def __init__(self): <br>
7. wx.Frame.__init__(self,parent = None,id = -1,pos = wx.DefaultPosition,title = u'iewin窗口') <br>
8. panel = wx.Panel(self) <br>
9. self.html = wx.lib.iewin.IEHtmlWindow(panel,-1,pos = wx.DefaultPosition,style = 0,name = 'OK') <br>
10. self.html.LoadUrl('http://www.cnbeta.com/articles/105719.htm') <br>
11. <br>
12. sizer = wx.BoxSizer(wx.HORIZONTAL) <br>
13. sizer.Add(self.html,1, wx.ALL|wx.EXPAND,0) <br>
14. panel.SetSizer(sizer) <br>
15. sizer.Fit(self) <br>
16. self.html.AddEventSink(self) <br>
17. <br>
18. def DocumentComplete(self,pDisp,URL): <br>
19. <br>
20. print isinstance(self.html.GetText(),unicode) <br>
21. s = self.html.GetText().encode('utf8') <br>
22. fi = open('1.txt','w') <br>
23. t = s.replace('乔布斯','片子') <br>
24. fi.write(t) <br>
25. fi.close() <br>
26. <br>
27. <br>
28. if __name__=='__main__': <br>
29. app= wx.PySimpleApp() <br>
30. frame = MyFrame() <br>
31. frame.Show() <br>
32. app.MainLoop() <br><br>
#!/usr/bin/env python #coding=utf-8 import wx.lib.iewin import wx,time class MyFrame(wx.Frame): def __init__(self): wx.Frame.__init__(self,parent = None,id = -1,pos = wx.DefaultPosition,title = u'iewin窗口') panel = wx.Panel(self) self.html = wx.lib.iewin.IEHtmlWindow(panel,-1,pos = wx.DefaultPosition,style = 0,name = 'OK') self.html.LoadUrl('http://www.cnbeta.com/articles/105719.htm') sizer = wx.BoxSizer(wx.HORIZONTAL) sizer.Add(self.html,1, wx.ALL|wx.EXPAND,0) panel.SetSizer(sizer) sizer.Fit(self) self.html.AddEventSink(self) def DocumentComplete(self,pDisp,URL): print isinstance(self.html.GetText(),unicode) s = self.html.GetText().encode('utf8') fi = open('1.txt','w') t = s.replace('乔布斯','片子') fi.write(t) fi.close() if __name__=='__main__': app= wx.PySimpleApp() frame = MyFrame() frame.Show() app.MainLoop()</p>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值