Extract HTML Title, Description, Keywords（Chilkat/Python学习二）

最新推荐文章于 2023-06-02 00:23:30 发布

Xiao_Qiang_

最新推荐文章于 2023-06-02 00:23:30 发布

阅读量1.9k

点赞数

CC 4.0 BY-SA版权

分类专栏： python 文章标签： html python

本文链接：https://blog.youkuaiyun.com/Xiao_Qiang_/article/details/2820784

python 专栏收录该内容

27 篇文章

订阅专栏

本文介绍如何使用Chilkat库在Python中实现基本的网页爬取功能，并解决中文乱码问题。通过实例演示了如何获取指定网站的页面标题、描述及关键词。

既然自己要学习 Chilkat，那就接着写他的东西吧；

好了，开始吧！
首先你要学习这篇内容你必须了解python语法，python很简单，但是做的事不简单，这也是我学习他的原因；还有你必学安装 Chilkat，具体细节去看我的

Getting Started Spidering a Site使用Chilkat（python）练习的一个爬虫（from :http://www.example-code.com）

http://blog.youkuaiyun.com/Xiao_Qiang_/archive/2008/08/23/2820293.aspx

一、源码

  from extra import  chilkat
# The Chilkat Spider component/library is free.
spider = chilkat.CkSpider()

# The spider object crawls a single web site at a time.  As you'll see
# in later examples, you can collect outbound links and use them to
# crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("http://www.vtchina.com/")

# Add the 1st URL:
spider.AddUnspidered("http://www.vtchina.com/")


# Begin crawling the site by calling CrawlNext repeatedly.

for i in range(0,10):

    success = spider.CrawlNext()
    if (success == True):
        # Show the URL of the page just spidered.
        print spider.lastUrl()

        # The HTML META keywords, title, and description are available in these properties:
        print spider.lastHtmlTitle()
        
        info = spider.lastHtmlDescription()
        HtmlDescription = unicode(info,"utf-8")
        print HtmlDescription
        print spider.lastHtmlKeywords()

        # The HTML is available in the LastHtml property
    else:
        # Did we get an error or are there no more URLs to crawl?
        if (spider.get_NumUnspidered() == 0):
            print "No more URLs to spider"
        else:
            print spider.lastErrorText()

    # Sleep 1 second before spidering the next URL.
    spider.SleepMs(1000)

 

注意我这里是爬网站http://www.vtchina.com/，是一个中文的网站，程序执行下来，语句 print spider.lastHtmlTitle() 输出的是乱码，处理方法到调用chilkatPython的目录下，先把chilkat.cpy修改一下文件名，反正不要是 chilkat就可以了，防止调用他而不去调用 chilkat.py；然后我们再修改 chilkat.py；在 chilkat.py中找到 def lastHtmlTitle(*args):函数，修改为

      def lastHtmlTitle(*args):
        utfchar = _chilkat.CkSpider_lastHtmlTitle(*args)
        info = unicode(utfchar,"utf-8")
        return info

 

这样输出的就不是乱码了。

由于是很入门的例子，代码没啥具体可说的，就是取页面title的功能。