大家都知道在页面的超链接中可以连接很多东西,有用的,非常有用的,或是无聊的,无用的,甚至还有错误的,空的,还有莫名其妙的;做个爬虫是很幸苦的,老是被href欺骗感情,遇到这么些个东西你该怎么办呢,过滤掉他,一脚把他踢得远远的,爬虫我的感情是很丰富,但是绝对不喜欢滥交的;
代码:
- spider = chilkat.CkSpider()
- # The spider object crawls a single web site at a time. As you'll see
- # in later examples, you can collect outbound links and use them to
- # crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
- spider.Initialize("www.chilkatsoft.com")
- # Add the 1st URL:
- spider.AddUnspidered("http://www.chilkatsoft.com/")
- # Avoid URLs matching these patterns:
- spider.AddAvoidPattern("*java*")
- spider.AddAvoidPattern("*python*")
- spider.AddAvoidPattern("*perl*")
- # Begin crawling the site by calling CrawlNext repeatedly.
- for i in range(0,10):
- success = spider.CrawlNext()
- if (success == True):
- # Show the URL of the page just spidered.
- print spider.lastUrl()
- # The HTML is available in the LastHtml property
- else:
- # Did we get an error or are there no more URLs to crawl?
- if (spider.get_NumUnspidered() == 0):
- print "No more URLs to spider"
- else:
- print spider.lastErrorText()
- # Sleep 1 second before spidering the next URL.
- spider.SleepMs(1000)
"dtd,xsd,javascript,(,zip,rar"等等,看实际情况需要了;
注:在 chilkat中有很多对于url限制功能的函数,具体可以看http://www.example-code.com/python/pythonspider.asp