MiniCrowler

最新推荐文章于 2025-09-05 14:01:23 发布

转载最新推荐文章于 2025-09-05 14:01:23 发布 · 82 阅读

文章标签：

#python

MiniCrawler

Github Path :

https://github.com/LixinZhang/miniCrowler

Introduction:

MiniCrawler is a simple web crawler implemented by Python.
Threadpool tech is used to speed up fetching pages.
One can config the crawler through modify the file config.py. And start the crawling job using python run.py.
The webs pages fetched will be stored in pages folder.
check_status.py helps you check the job's status as following:

Rank            Hostname        Times   
----------------------------------------
   1             buaa.edu.cn        40  
   2             baixing.com        32  
   3             cnblogs.com        29  
   4              hao123.com         5  
   5           xinhuanet.com         2  
   6          visionplaza.cn         2  
   7           people.com.cn         2  
   8                  org.cn         2  
   9                 news.cn         2  
  10             most.gov.cn         2