深度解密 python challenge第4关之 小小的爬虫

解决思路:

     我们可以查看下源码得到如下信息:
<html>
<head>
  <title>follow the chain</title>
  <link rel="stylesheet" type="text/css" href="../style.css">
</head>
<body>
<!-- urllib may help. DON'T TRY ALL NOTHINGS, since it will never 
end. 400 times is more than enough. -->
<center>
<a href="linkedlist.php?nothing=12345"><img src="chainsaw.jpg" border="0"/></a>
<br><br><font color="gold"></center>
Solutions to previous levels: <a href="http://wiki.pythonchallenge.com/"/>Python Challenge wiki</a>.
<br><br>
IRC: irc.freenode.net #pythonchallenge
</body>
</html>

关键信息:看这里 linkedlist.php?nothing=12345 告诉我们要跳转的位置是要覆盖一个parameter.
当我们再次访问地址:http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345打开后 是:the next nothing is 93781 不断传递参数 永远得不到最终有效的值。
看来是一个循环啊!!来来来不多说直接代码搞定它!

解决方法和代码:

# -*- coding:utf-8 -*-
# **********************************
# ** http://weibo.com/lixiaodaoaaa #
# ** create at 2017/6/14   20:49 ***
# ****** by:lixiaodaoaaa ***********


# *************************source  file is here*************************
# <html>
# <head>
# <title>follow the chain</title>
# <link rel="stylesheet" type="text/css" href="../style.css">
# </head>
# <body>
# <!-- urllib may help. DON'T TRY ALL NOTHINGS, since it will never
# end. 400 times is more than enough. -->
# <center>
# <a href="linkedlist.php?nothing=12345"><img src="chainsaw.jpg" border="0"/></a>
# <br><br><font color="gold"></center>
# Solutions to previous levels: <a href="http://wiki.pythonchallenge.com/"/>Python Challenge wiki</a>.
# <br><br>
# IRC: irc.freenode.net #pythonchallenge
# </body>
# </html>
# *************************source  file is here*************************

import re
import urllib

nothing = "12345"
startTime = 0


def getNumberAndContent(nothing):
    sourceUrl = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
    urlContent = urllib.urlopen(sourceUrl + nothing).read()
    numberPattern = "next nothing is (\d+)"
    newNothing = "".join(re.findall(numberPattern, urlContent))
    if len(newNothing.strip()) < 1 and "Divide" in urlContent:
        newNothing = str(int(nothing) / 2)
    else:
        nothing = newNothing
    return newNothing, urlContent


if __name__ == '__main__':
    theEndSkip = ""
    thisStepUrl = "http://www.pythonchallenge.com/pc/def/linkedlist.php"
    while True:
        nothing, urlContent = getNumberAndContent(nothing)
        startTime += 1
        if len(nothing.strip()) > 0:
            print "第 %d 次  nothing is  %s ,content is  %s" % (startTime, int(nothing), urlContent)
        if "html" in urlContent:
            print urlContent
            theEndSkip = urlContent
            break
    theNextStepUrl = thisStepUrl.replace("linkedlist.php", theEndSkip)
    print theNextStepUrl




评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值