python抓取google链接原理详解

最新推荐文章于 2024-09-24 08:37:50 发布

转载最新推荐文章于 2024-09-24 08:37:50 发布 · 1.4k 阅读

·

0

·

python笔记专栏收录该内容

25 篇文章

订阅专栏

本文介绍如何使用Python结合urllib2和simplejson库实现网络爬虫功能，包括构建请求、发送请求、接收响应及解析JSON数据。通过实例演示了如何搜索并获取网页内容，提取URL等关键信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

#coding=utf-8
import urllib2,urllib
import simplejson
                                  
seachstr = 'hello'
                                  
for x in range(6):
    print "page:%s"%(x+1)
    page = x * 8
                                      
    url = ('https://ajax.googleapis.com/ajax/services/search/web'
                  '?v=1.0&q=%s&rsz=8&start=%s') % (urllib.quote(seachstr),page)
    try:
        # class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])
        request = urllib2.Request(
        url, None, {'Referer': 'http://www.sina.com'})
        response = urllib2.urlopen(request)  #urlopen返回的是文件对象
                                  
        # Process the JSON string.
        results = simplejson.load(response)
        infoaaa = results['responseData']['results']
    except Exception,e:
        print e
    else:
        for minfo in infoaaa:
            print minfo['url']

原理详解：

1.google api

https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=hello&rsz=2&start=1

q：搜索内容

rsz：每页显示条数

start：第几页

2. request = urllib2.Request(url, None, {'Referer':'http://www.sina.com'})

print request：

<urllib2.Request instance at0x02527670>

3. response = urllib2.urlopen(request)

print response：

<addinfourl at 38972272 whose fp =<socket._fileobject object at 0x025249F0>> 此为文件句柄

4. results = simplejson.load(response)

得到json的数据格式，利用json在线校验器可以得到清晰的样式。

print results：

{

u'responseData':{

u'cursor': {

...省略...

},

u'results':[

{

u'GsearchResultClass':u'GwebSearch',

u'visibleUrl':u'www.hellodesign.com',

u'titleNoFormatting': u'Hello',

u'title':u'<b>Hello</b>',

u'url': u'http://www.hellodesign.com/',

u'cacheUrl': u'http://www.google.com/search?q=cache: 2q-UgYT2lOsJ: www.hellodesign.com',

u'unescapedUrl': u'http://www.hellodesign.com/',

u'content': u'<b>Hello</b>isaninteractivedesignagency.Wecreateintelligent,

livingsystemsforpeopletoexperience.Webelieveinbuildingsystemsthatareuseful,

usable,

and<b>...</b>'

},

{

u'GsearchResultClass':u'GwebSearch',

u'visibleUrl':u'www.youtube.com',

u'titleNoFormatting':u'HellobyLionelRichie-YouTube',

u'title':u'<b>Hello</b>byLionelRichie-YouTube',

u'url': u'http: //www.youtube.com/watch%3Fv%3Db_ILDFp5DGA',

u'cacheUrl': u'http://www.google.com/search?q=cache: yrhDbK-_ZcEJ: www.youtube.com',

u'unescapedUrl': u'http://www.youtube.com/watch?v=b_ILDFp5DGA',

u'content': u'Jul8,

2009<b>...</b>ThemusicvideoforLionelRichie's"<b>Hello</b>"directedbyBobGiraldi,

attractsattentionasittellsthestoryofamusicteacher(playedbyLionel<b>...</b>'

}

]

},

u'responseDetails': None,

u'responseStatus': 200

}

5. infoaaa = results['responseData']['results']

从results中取出results

6. for minfo in infoaaa:

printminfo['url']

最后从infoaaa中取出url链接

http: //www.hellodesign.com/

http://www.youtube.com/watch%3Fv%3Db_ILDFp5DGA

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。