Avoid URLs Matching Any of a Set of Patterns（chilkat/python学习四）过滤url

本文链接：https://blog.youkuaiyun.com/Xiao_Qiang_/article/details/2821032

本文介绍了一种使用Python进行网页爬取的方法，并展示了如何通过设置过滤模式来避免爬取无用或不相关的链接，如特定文件类型和编程语言的链接。

大家都知道在页面的超链接中可以连接很多东西，有用的，非常有用的，或是无聊的，无用的，甚至还有错误的，空的，还有莫名其妙的；做个爬虫是很幸苦的，老是被href欺骗感情，遇到这么些个东西你该怎么办呢，过滤掉他，一脚把他踢得远远的，爬虫我的感情是很丰富，但是绝对不喜欢滥交的；

代码：

 spider = chilkat.CkSpider()
#  The spider object crawls a single web site at a time.  As you'll see
#  in later examples, you can collect outbound links and use them to
#  crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("www.chilkatsoft.com")
#  Add the 1st URL:
spider.AddUnspidered("http://www.chilkatsoft.com/")
#  Avoid URLs matching these patterns:
spider.AddAvoidPattern("*java*")
spider.AddAvoidPattern("*python*")
spider.AddAvoidPattern("*perl*")
#  Begin crawling the site by calling CrawlNext repeatedly.
for i in range(0,10):
    success = spider.CrawlNext()
    if (success == True):
        #  Show the URL of the page just spidered.
        print spider.lastUrl()
        #  The HTML is available in the LastHtml property
    else:
        #  Did we get an error or are there no more URLs to crawl?
        if (spider.get_NumUnspidered() == 0):
            print "No more URLs to spider"
        else:
            print spider.lastErrorText()
    #  Sleep 1 second before spidering the next URL.
    spider.SleepMs(1000)

在这里可以从代码中看到他过滤掉了"java、python、perl",但是在实际中我们应该过滤掉的应该是
"dtd,xsd,javascript,(,zip,rar"等等，看实际情况需要了；

注：在 chilkat中有很多对于url限制功能的函数，具体可以看http://www.example-code.com/python/pythonspider.asp